Tsinghua University's Storage Lab and Tencent's MegEngine AI Infra team recently announced that they won the global championship in the MoE Model Inference Optimization Challenge at MLSys2026, a top-tier international conference on machine learning systems.

QQ20260525-090728.jpg

Facing the inference performance bottlenecks of the mixture-of-experts (MoE) architecture with trillion-scale parameters on heterogeneous chips (NPU), the joint team designed a full-chain optimization solution for the official model and NPU hardware. By introducing the E-Shard strategy that splits tasks by experts, the PSUM 3D tensor batch readout, the GEMV path that scatters outputs to multiple Banks for concurrent processing, and leveraging scalar engines to reduce initial data transfer latency, the team successfully overcame low data transfer efficiency and repeated activation transfers at the underlying operator level.

Meanwhile, for the attention module, the team restructured on-chip data layout and integrated key Transformer operators, achieving bit-level high-precision alignment.

QQ20260525-090739.jpg

Figure 3: Schematic diagram of the MoE optimization structure, including E-Shard expert partitioning, continuous DMA, PSUM/GEMV concurrency, cold start pipeline, and prefetch control.

In this competition, the team also jointly developed an agent-based inference operator optimizer called "Knight." Through an automated closed-loop process of proposal, code implementation, and iteration, it significantly expanded the optimization search space. Ultimately, this solution reduced the end-to-end inference time of the model from 14.91s to 3.56s, achieving a performance acceleration of 4.1 times; the single-step decoding delay dropped from 12.63ms to 5.45ms, and the DMA engine utilization during weight loading increased to approximately 80%.

Beating teams from international top universities such as Stanford and MIT, this victory not only demonstrates the deep expertise of Chinese teams in adapting large models to underlying systems and optimizing operators, but also provides a highly valuable engineering paradigm for deploying trillion-parameter MoE models on future super-node computing platforms.