Large model inference is redefining AI infrastructure, and network architecture innovation has become the key path to unleashing hardware potential. In September 2025, Zhipu, Yuchen Network, and Tsinghua University published research on the ZCube network architecture at the top conference in the networking field, ACM SIGCOMM 2025.

On May 21, 2026, Zhipu announced that the architecture had been successfully implemented in the GLM-5.1 coding production environment, achieving a significant performance optimization. Benchmark tests showed that with GPU, software stack, and applications remaining unchanged, the ZCube architecture reduced capital expenditure for switches and optical modules by 33%, increased average GPU inference throughput by 15%, and reduced first token latency (TTFT P99) by 40.6%, achieving a system-level breakthrough that balances high economic efficiency and high performance.

QQ20260521-105720.jpg

Currently, as long context inference and Prefill-Decode (PD) separation deployment have become industry standards, the cross-node transmission of KV Cache shows a high degree of asymmetry. Traditional ROFT (Rail-Optimized Fat-Tree) architectures based on multi-layer switch stacking suffer from static topology limitations, making them prone to local hotspots and PFC backpressure, creating a structural bottleneck characterized by "sufficient total bandwidth but frequent local congestion."

QQ20260521-105738.jpg

To address this pain point, the ZCube architecture breaks away from the hierarchical stacking approach of traditional Clos architecture, eliminating the Spine layer switches and using two groups of completely flat switches for bipartite graph interconnection, combined with a dual-port NIC's single/multi-track hybrid access mechanism. With its unique routing strategy, ZCube ensures that any GPU pair has a dedicated optimal path, achieving perfect traffic load balancing at the structural level and supporting ultra-large-scale expansion of tens of thousands or even hundreds of thousands of GPUs.

In the production environment transformation, the Yuchen Network team successfully overcame the challenges of cabling and route strategy reconstruction using automated control and verification tools, ensuring a fast and stable cluster upgrade. The current thousand-card cluster has been running stably for more than two weeks. The successful implementation of ZCube marks that intelligent computing infrastructure is moving from general interconnection to system collaboration driven by model traffic. In the future, the deep integration of network topology, communication libraries, and scheduling strategies will become the core driving force for further improving Token production efficiency and reducing MaaS overall costs.