Nvidia, Qualcomm Shine in MLPerf Inference; Intel’s Sapphire Rapids Makes an Appearance.
The steady maturation of MLCommons/MLPerf as an AI benchmarking tool was apparent in today’s release of MLPerf v2.1 Inference results. Twenty-one organizations submitted 5,300 performance results and 2,400 power measurement. While Nvidia again dominated, showcasing its new H100 GPU, Qualcomm impressed on both performance and power consumption metrics. Intel ran a Sapphire Rapids CPU-based system, albeit in the preview category. AI start-ups Biren, Moffet AI, Neural Magic, Sapeon were among the entrants.
In some ways, making sense of MLPerf results has grown more complicated as the diversity of submitters and system sizes and system configurations have grown. But that’s sort of the goal in the sense that MLPerf’s extensive benchmark suites and divisions (closed, open, etc.) permit more granular comparisons for system evaluators but it takes some effort. Moreover, MLCommons (parent org.) keeps evolving its test suites to stay current and relevant. In this, the sixth set of MLPerf inference results, a new division (Inference over the network) and a new object detection model (RetinaNet) were added.
While training is regarded as a more HPC-related activity, inference is regarded as the enterprise workhorse. Depending on the application scenario, latency may or may not be an issue, accuracy is always an issue, and power consumption, particularly for cloud and edge applications, is critical. The MLPerf inference suite, broadly, has provisions for all of these characteristics. Fortunately, MLCommons had made parsing the results fairly easy. (Link to results)
As has become MLPerf’s practice, vendors are allowed to submit statements describing the systems used in the MLPerf exercise and special features that may assist their AI-related performance. Of note, at least one participant – Neural Magic – is a software company. The companies’ statements, as provided by MLCommons, are provided at the end of the article.
Nvidia Still Rules the Roost
Nvidia is the only company that has run all of the tests in (closed division) in every MLPerf run. It showcased its new H100 GPU in the preview category (available within six months) and its A100 in the closed division. H100 was up to 4.5x faster than the A100-based systems.
David Salvator, director of AI inference, benchmarking, and cloud, at Nvidia, said the big gains were made possible by leveraging Nvidia’s transformer engine which uses FP8 where possible. “Basically [it’s] a technology that makes intelligent use of the FP8 precision. When you go to lower precision, one of the concerns is rounding errors introduce error and inaccuracy in the answers that a neural network is giving you back.” The software-defined transformer engine is able to choose where to use FP8. “The H100 should really be thought of now as our current generation product,” he said.
Not that the wildly successful A100 is going away. It swept all the datacenter closed division metrics and, said Salvator, can even be used at the edge, “The A100 has that ability to be deployed at the edge, particularly our PCIe based version, which can be put into a one rack server that be deploy at the edge if you need a very performant platform at the edge to be doing inference and even some like training work.”
Qualcomm’s AI 100 Continues to Impress
Qualcomm followed up its strong showing in the spring MLPerf Inference (v2.0) with another set. The company’s Cloud AI 100 accelerator is gaining traction and, of course, its Snapdragon processor has long been a mobile industry leader. The AI 100 is mostly used in one of two versions – Pro (400 TOPS) or Standard (300 TOPS) and features miserly power use.
John Kehrli, a senior director of product management, reiterated Qualcomm’s strength in low power devices, “Our story, continues here in terms of power efficiency. In this particular [MLPerf] cycle my perspective is what you’ll really see is showcasing our performance across a number of commercial platforms. You’ll see new submissions from Dell, HP, Lenovo, Thundercomm. You’ll see a new variant from Inventec as well, Foxconn, Gloria is doing a higher submission as well. And see submissions with Standard and Pro.”
Again, it is necessary to compare systems side-by-side to get a clear picture of strengths and weaknesses. The AI 100 was a strong performer at the edge, including the best performance in edge single stream for ResNet-50 and performed well in the multi-stream as well. The company also pulled together some queries-per sec-per-watt analyses which were impressive.
Analyst Karl Freund wrote, “The fact that both Gloria (Foxconn) and Inventec Heimdall, suppliers to cloud service companies, submitted results to MLCommons tells us that Qualcomm may be realizing traction in the Asian cloud market, while Dell, Lenovo, and HPE support indicates global interest in the Qualcomm part for datacenter and “Edge Clouds.”
“The current Cloud AI 100 demonstrates dramatically superior efficiency for image processing compared to both cloud and edge competitors. This makes sense as the heritage of the accelerator is the high-end Snapdragon mobile processor’s Qualcomm AI Engine, which provides AI for mobile handsets where imaging is the primary application. Nonetheless, the Qualcomm platform provides best-in-class performance efficiency for the BERT-99 model, used in natural language processing.”
Good to See Growing Participation
It was interesting to see the Intel Sapphire Rapids-based development system (1-node-2S-SPR-PyTorch-INT8, 2 CPUs) using PyTorch. The preview category is for products that will be available in six months.
Intel’s Jordan Plawner, senior director of AI products, said, “We’re not releasing our core-count and other products specifics. Obviously, there’ll be a range of core counts. This was performed on final silicon, but it is still what I would call with the pre-production software stack. So we expect improvements as we optimize the software between now and launch, which the launch date has not yet been announced.”
Several relative newcomers to AI acceleration submitted entries and in addition to looking at their results it may be useful to look at their provided statements (included below in compilation of statements below).
One hot topic at the formal MLCommons/MLPerf press analyst briefing was so-called sparsity and its use to speed performance. There are, of course, many approaches. It can, however, spoil the apples-to-apples nature of submissions in the closed division.
Nvidia’s Salvator noted, “Some submitters, I believe, have actually done sparse submissions in the open category as a way to sort of show their abilities to work with sparse data sets. As you probably know, we want to specify a data and you typically have to do a recalibration on the training side [to use or sparsity], right. In today’s MLPerf inference on the closed side, we don’t really have a provision for that yet. It’s something of course, we’re always looking at.
“One of the important things about the closed category is that everybody has to do the same math. There are certain optimizations, shortcuts, whatever you want to call them, that can be taken involving things like sparsity and involving pruning, where you take parts of the network that aren’t really being lit up and basically just trim them away because they’re not being used. Those are optimizations that can be valid. But for the purpose of comparison, and making sure that everyone has done the same work, the closed category does not allow those,” he said.
Stay tuned.
Link to MLPerf Inference Datacenter v2.1 results: https://mlcommons.org/en/inference-datacenter-21/
Link to MLPerf Inference Edge v2.1 results: https://mlcommons.org/en/inference-edge-21/
Link to Nvidia blog: https://blogs.nvidia.com/blog/2022/09/08/hopper-mlperf-inference/
Continue reading the full article, including participating vendor statements, at sister publication, HPCwire.