Tsinghua University and Tencent Hunyuan Win the MLSys2026 MoE Inference Challenge with a 4.1x Speedup on NPU

Tsinghua University's Storage Lab and Tencent's MegEngine AI Infra team recently announced that they won the global championship in the MoE Model Inference Optimization Challenge at MLSys2026, a top-tier international conference on machine learning systems.

Facing the inference performance bottlenecks of the mixture-of-experts (MoE) architecture with trillion-scale parameters on heterogeneous chips (NPU), the joint team designed a full-chain optimization solution for the official model and NPU hardware. By introducing the E-Shard strategy that splits tasks by experts, the PSUM 3D tensor batch readout, the GEMV path that scatters outputs to multiple Banks for concurrent processing, and leveraging scalar engines to reduce initial data transfer latency, the team successfully overcame low data transfer efficiency and repeated activation transfers at the underlying operator level.

Meanwhile, for the attention module, the team restructured on-chip data layout and integrated key Transformer operators, achieving bit-level high-precision alignment.

Figure 3: Schematic diagram of the MoE optimization structure, including E-Shard expert partitioning, continuous DMA, PSUM/GEMV concurrency, cold start pipeline, and prefetch control.

In this competition, the team also jointly developed an agent-based inference operator optimizer called "Knight." Through an automated closed-loop process of proposal, code implementation, and iteration, it significantly expanded the optimization search space. Ultimately, this solution reduced the end-to-end inference time of the model from 14.91s to 3.56s, achieving a performance acceleration of 4.1 times; the single-step decoding delay dropped from 12.63ms to 5.45ms, and the DMA engine utilization during weight loading increased to approximately 80%.

Beating teams from international top universities such as Stanford and MIT, this victory not only demonstrates the deep expertise of Chinese teams in adapting large models to underlying systems and optimizing operators, but also provides a highly valuable engineering paradigm for deploying trillion-parameter MoE models on future super-node computing platforms.

High Demand Causes Strain on Computing Power WorkBuddy Completes Emergency Upgrade of Tencent Hunyuan Hy3 Model

WorkBuddy and Tencent Hunyuan joint project announced that due to the popularity of the initial launch of the Hunyuan Hy3 large model, computing resources experienced peak congestion. Since 10:00 a.m. on July 8th, a queue phenomenon occurred. The project team has completed an emergency computing power expansion and scheduling to ensure stable service operations.

AI Audio Editing Enters a New Era: Tencent Hunyuan Collaborates with Leading Institutions to Release the MMAE Benchmark. Current Model Precision in Audio Editing is Less Than 5%

Tencent Hunyuan, in collaboration with Shanghai Jiao Tong University, Nanyang Technological University of Singapore, Tianjin University, Peking University, and Fudan University, has launched the first general instruction-driven audio editing benchmark dataset, MMAE. This benchmark addresses the current limitations in AI's ability to edit audio, filling a gap in the field of audio generation and providing an important evaluation standard for multi-task audio editing research.

Tencent Open-Sources Multilingual Translation Tool Hy-MT2, Lightweight Version Only 440MB for Local Execution, Mini Program Now Available

Tencent Hunyuan recently open-sourced the multilingual translation model Hy-MT2 and launched the "Tencent Hy Translation" mini program. This model family includes three sizes, supporting mutual translation among 33 languages and five ethnic languages/dialects. The lightweight Hy-MT2-1.8B uses Tencent's self-developed AngelSlim 1.25-bit extreme quantization technology, optimized for mobile devices, balancing high quality with efficiency.

Tencent Hunyuan Launches Its First Industrial-Level 2Bit Edge Model: Achieving a Performance Turnaround with a 0.3B Parameter Size

Tencent Hunyuan releases the ultra-small model HY-1.8B-2Bit, which reduces the equivalent parameter count to 0.3B through an industrial-level 2Bit quantization scheme, with memory usage of approximately 600MB and a size smaller than some mobile applications. This technological breakthrough solves the problem of significant precision loss in low-bit quantization, providing a new approach for efficient deployment of large models on consumer-grade hardware.

Talent Battle: AI Star Peng Tianyu from Tsinghua University Joins Tencent Hunyuan to Lead Multi-modal RL Research

AI expert Peng Tianyu joins Tencent Hunyuan as Chief Research Scientist and Head of Multi-modal Reinforcement Learning Technology, responsible for building top-tier teams to tackle cutting-edge challenges in multi-modal generation and understanding. Peng Tianyu is a direct Ph.D. student from the Department of Computer Science at Tsinghua University, mentored by Professor Zhu Jun, with a strong academic background.