On May 22, Zhipu (02513.HK) caused a stir in the industry both in the capital market and on the technology front. With its stock surging over 22% during the session and its market value stabilizing above 450 billion HKD, Zhipu officially launched a major new product for enterprise customers: GLM-5.1 Highspeed API (GLM-5.1-highspeed).

This model achieves an impressive 400 tokens/s (400 tokens per second) in actual tests, breaking the current speed limit of official APIs from global large model providers. This speed means that the amount of text a creator could write after days of continuous work can be completed in just one minute; a system reengineering task that would take engineers three days of typing can be fully executed within the time it takes to have a cup of coffee.

Key Highlights:

  • Breaking conventions: Traditionally, the industry assumed that "fast means small or lightweight." Zhipu is the first domestic large model provider to achieve a perfect combination of **"full-scale flagship capabilities" and "extreme low latency."**

  • Hardcore achievements: Output speed reaches 400 tokens/s, supporting a 200K long context window, with a maximum single output of 128K tokens.

  • Underlying black tech: Developed through deep collaboration between Zhipu's GLM team and TileRT team, it restructures the entire system-level inference ecosystem.

  • Targeted pilot test: Now available through Zhipu's MaaS (Model as a Service) open platform to select enterprise customers.

How smooth is "instant response"? A "downward strike" on speed-sensitive scenarios

Over the past year, the coding (programming) and agent (intelligence) collaboration capabilities of domestic large models have made significant progress, but "speed" has always been the core bottleneck for long-chain, high-frequency interaction tasks. Zhipu pointed out that the shift of large models from "tools" to "real-time partners" is transformative with a 400 tokens/s experience:

  • AI Programming (Coding Agent): Traditional intelligent agents often require dozens of cross-file calls and long-text alignment. If a single round of response lags by a few seconds, the overall task could stretch to more than ten minutes. With the high-speed version, writing code feels like it's running at 10x speed, functions, interfaces, and underlying call chains instantly unfold along with the user's keyboard strokes, eliminating any waiting time for large-scale engineering rework.

  • Real-time interaction and 3D games: Extremely low latency allows the model to perfectly handle real-time dynamic generation within game worlds, instant web UI construction, and immediately change system states and interface feedback based on continuous user input without lag.

  • Business decision clusters: In multi-agent parallel simulation and real-time big data analysis scenarios, the high-speed version supports completing complex webpage agent cluster multi-personality parallel responses within 30 seconds, significantly raising the efficiency ceiling for high-frequency quantification and simulation.

  • Seamless real-time voice: In AI coaching and smart customer service scenarios, ultra-fast response makes the latency from speech recognition (ASR) to synthesis (TTS) approach zero, delivering truly equal and natural human-like conversation flow.

Decoding Three Layers of Black Tech: How Was 400 Tokens/s Achieved?

The creation of this global speed record is mainly due to the system-level engineering optimization jointly developed by Zhipu GLM team and TileRT team. The 400 tokens/s is not a flashy "peak moment," but a stable and usable production-level capability. Its underlying optimization logic is divided into three layers:

[Infrastructure Layer: Cluster/Load Balancing Collaboration] ─► [Scheduling System Layer: Dynamic Batching & KV Cache Scheduling] ─► [Inference Engine Layer: TileRT Architecture Rewriting Core Path] ─► 400 tokens/s Stable Output
  1. Inference Engine Layer (TileRT Deep Customization): Based on the unique network architecture of GLM-5.1, the team completely rewrote the most critical inference path and underlying operators, allowing the throughput and hardware execution efficiency of a single GPU to approach physical limits.

  2. Scheduling System Layer (Intelligent Merging): Introduced highly aggressive dynamic batching (Dynamic Batching), request merging technologies, and groundbreaking KV cache (KV Cache) scheduling optimization, effectively solving the tail latency issue that traditional models often face under high concurrency and multiple user requests.

  3. Infrastructure Layer (Cluster Collaboration): Conducted comprehensive hardware-level collaborative optimization around the networking deployment, network link topology, and ultra-high-frequency load balancing of the inference cluster, ensuring that computing power is transmitted without loss along the entire pipeline.

Industry Reassessment: The AI Second Half Is a "Value and Time" Settlement

As international top-tier analytical institutions such as UBS emphasized at recent Hong Kong stock technology forums, this round of AI-driven industry reassessment is fundamentally different from the "traffic and time monetization" of the mobile internet era. The revenue and survival philosophy of AI is not about keeping users trapped in software, but "helping users and enterprises save time, improve efficiency, and share value created."

Zhipu's GLM-5.1 Highspeed Version perfectly addresses this pain point. By compressing the cost and time of producing each token to a fraction of the original, it allows enterprises to no longer have to make painful compromises between "high intelligence (choosing a large model but slow)" and "speed (choosing a small model but dull)."