New MLPerf Inference v4.1 Benchmark Results Highlight Rapid Hardware and Software Innovations in Generative AI Systems

New mixture of experts benchmark tracks emerging architectures for AI models

Today, MLCommons® announced new results for its industry-standard MLPerf®Inference v4.1 benchmark suite, which delivers machine learning (ML) system performance benchmarking in an architecture-neutral, representative, and reproducible manner. This release includes first-time results for a new benchmark based on a mixture of experts (MoE) model architecture. It also presents new findings on power consumption related to inference execution. 

MLPerf Inference v4.1 

The MLPerf Inference benchmark suite, which encompasses both data center and edge systems, is designed to measure how quickly hardware systems can run AI and ML models across a variety of deployment scenarios. The open-source and peer-reviewed benchmark suite creates a level playing field for competition that drives innovation, performance, and energy efficiency for the entire industry. It also provides critical technical information for customers who are procuring and tuning AI systems. 

The benchmark results for this round demonstrate broad industry participation, and includes the debut of six newly available or soon-to-be-shipped processors: 

○ AMD MI300x accelerator (available) 

○ AMD EPYC “Turin” CPU (preview) 

○ Google “Trillium” TPUv6e accelerator (preview) 

○ Intel “Granite Rapids” Xeon CPUs (preview) 

○ NVIDIA “Blackwell” B200 accelerator (preview) 

○ UntetherAI SpeedAI 240 Slim (available) and SpeedAI 240 (preview) accelerators 

MLPerf Inference v4.1 includes 964 performance results from 22 submitting organizations: AMD, ASUSTek, Cisco Systems, Connect Tech Inc, CTuning Foundation, Dell Technologies, Fujitsu, Giga Computing, Google Cloud, Hewlett Packard Enterprise, Intel, Juniper Networks, KRAI, Lenovo, Neutral Magic, NVIDIA, Oracle, Quanta Cloud Technology, Red Hat, Supermicro, Sustainable Metal Cloud, and Untether AI.

“There is now more choice than ever in AI system technologies, and it’s heartening to see providers embracing the need for open, transparent performance benchmarks to help stakeholders evaluate their technologies,” said Mitchelle Rasquinha, MLCommons Inference working group co-chair. 

New mixture of experts benchmark 

Keeping pace with today’s ever-changing AI landscape, MLPerf Inference v4.1 introduces a new benchmark to the suite: mixture of experts. MoE is an architectural design for AI models that departs from the traditional approach of employing a single, massive model; it instead uses a collection of smaller “expert” models. Inference queries are directed to a subset of the expert models to generate results. Research and industry leaders have found that this approach can yield equivalent accuracy to a single monolithic model but often at a significant performance advantage because only a fraction of the parameters are invoked with each query. 

The MoE benchmark is unique and one of the most complex implemented by MLCommons to date. It uses the open-source Mixtral 8x7B model as a reference implementation and performs inferences using datasets covering three independent tasks: general Q&A, solving math problems, and code generation. 

“When determining to add a new benchmark, the MLPerf Inference working group observed that many key players in the AI ecosystem are strongly embracing MoE as part of their strategy,” said Miro Hodak, MLCommons Inference working group co-chair. “Building an industry-standard benchmark for measuring system performance on MoE models is essential to address this trend in AI adoption. We’re proud to be the first AI benchmark suite to include MoE tests to fill this critical information gap.” 

Benchmarking Power Consumption 

The MLPerf Inference v4.1 benchmark includes 31 power consumption test results across three submitted systems covering both datacenter and edge scenarios. These results demonstrate the continued importance of understanding the power requirements for AI systems running inference tasks. as power costs are a substantial portion of the overall expense of operating AI systems. 

The Increasing Pace of AI Innovation 

Today, we are witnessing an incredible groundswell of technological advances across the AI ecosystem, driven by a wide range of providers including AI pioneers; large, well-established technology companies; and small startups. 

MLCommons would especially like to welcome first-time MLPerf Inference submitters AMD and Sustainable Metal Cloud, as well as Untether AI, which delivered both performance and power efficiency results. 

“It’s encouraging to see the breadth of technical diversity in the systems submitted to the MLPerf Inference benchmark as vendors adopt new techniques for optimizing system performance such as vLLM and sparsity-aware inference,” said David Kanter, Head of MLPerf at MLCommons.

“Farther down the technology stack, we were struck by the substantial increase in unique accelerator technologies submitted to the benchmark this time. We are excited to see that systems are now evolving at a much faster pace – at every layer – to meet the needs of AI. We are delighted to be a trusted provider of open, fair, and transparent benchmarks that help stakeholders get the data they need to make sense of the fast pace of AI innovation and drive the industry forward.” 

View the Results 

To view the results for MLPerf Inference v4.1, please visit HERE.

Sign up for the free insideAI News newsletter.

Join us on Twitter: https://twitter.com/InsideBigData1

Join us on LinkedIn: https://www.linkedin.com/company/insideainews/

Join us on Facebook: https://www.facebook.com/insideAINEWSNOW