Microsoft recently announced that its Azure ND GB300v6 virtual machine achieved an industry new record of 1.1 million tokens per second for inference on Meta's Llama270B model. Microsoft CEO Satya Nadella stated on social media: "This achievement is the result of our long-term collaboration with NVIDIA and our expertise in running artificial intelligence at production scale."

Microsoft

The Azure ND GB300 virtual machine uses NVIDIA's Blackwell Ultra GPU, specifically the NVIDIA GB300NVL72 system, which is equipped with 72 NVIDIA Blackwell Ultra GPUs and 36 NVIDIA Grace CPUs, using a single-machine architecture design. This virtual machine is optimized for inference workloads, featuring a 50% increase in GPU memory and a 16% increase in thermal design power (TDP).

To verify the performance improvements, Microsoft ran the Llama270B (FP4 precision) model on 18 ND GB300v6 virtual machines within an NVIDIA GB300NVL72 domain, using NVIDIA TensorRT-LLM as the inference engine. Microsoft stated: "An Azure ND GB300v6 in an NVL72 rack achieved a total inference speed of 1.1 million tokens per second." This new record surpassed Microsoft's previous achievement of 865,000 tokens per second on an NVIDIA GB200NVL72 rack.

According to the system configuration, each GPU's performance is approximately 15,200 tokens per second. Microsoft also provided detailed simulation processes and all log files and results. This performance record has been verified by Signal65, an independent performance verification and benchmarking company.

Russ Feroes, Vice President of Laboratories at Signal65, noted in a blog post: "This milestone not only broke through the barrier of one million tokens per second, but also achieved it on a platform that meets the dynamic usage and data governance needs of modern enterprises." He added that Azure ND GB300 improved inference performance by 27% compared to the previous generation NVIDIA GB200, while only increasing the power specifications by 17%. Compared to the NVIDIA H100, GB300 achieved nearly ten times the inference performance, while improving rack-level power efficiency by nearly 2.5 times.

Key Points:   

🚀 The Microsoft Azure ND GB300v6 virtual machine achieved an inference speed of 1.1 million tokens per second, setting a new industry record.   

💻 This virtual machine is configured with 72 NVIDIA Blackwell Ultra GPUs and 36 NVIDIA Grace CPUs, optimized for inference.   

📈 Compared to the previous generation, Azure ND GB300 improved inference performance by 27% and power efficiency by nearly 2.5 times.