Moonshot AI Collaborates with Tsinghua University to Launch PrfaaS Architecture, Breaking the Bottleneck of Large Model Computing Power

The performance bottleneck of large language models (LLMs) is being broken by new technologies. Recently, the research team from Moonshot AI (Moonshot) and Tsinghua University jointly proposed a new architecture called **Prefill-as-a-Service (PrfaaS)**. This study aims to address hardware limitations faced when deploying large model services in data centers by optimizing the allocation of computing resources, significantly improving inference efficiency.

Technical Breakthrough: "Surgical" Separation of Prefill and Decode

Currently, the inference process of large language models mainly consists of two distinct phases:

Prefill Phase: It is computation-intensive, responsible for processing input and generating key-value cache (KVCache).
Decode Phase: It is memory bandwidth-intensive, responsible for generating output word by word.

In traditional service architectures, these two phases are usually processed within the same data center or even the same server. Due to their differing hardware resource requirements, this "forced bundling" often leads to an imbalance in the allocation of computing resources and bandwidth, resulting in service congestion.

Core Innovation: Efficient Collaboration Across Regions

PrfaaS Architecture’s core highlight is achieving a decoupled service. It offloads the high-computation-intensive prefill tasks to specialized high-computing clusters. After the task is completed, the system uses general Ethernet technology to remotely transfer the generated KVCache to a local decoding cluster.

This design breaks the physical space constraints, allowing prefill and decoding to be performed simultaneously in different data centers. To ensure efficient transmission, PrfaaS introduces a two-time-scale scheduling mechanism. This mechanism can flexibly allocate resources according to real-time traffic fluctuations, combined with precise routing mechanisms, ensuring that long text requests do not experience delays due to uneven resource distribution during transmission.

Test Results: Dual Optimization of Throughput and Latency

Research data shows that the PrfaaS architecture performs remarkably in practical applications:

Service throughput increased by 54%, greatly enhancing the ability to handle requests per unit time.
Significantly reduced response latency, making the first character generation faster from the user's perspective.
Maximized resource utilization, by separating the computing, network, and storage subsystems, it avoids the congestion problems in traditional architectures.

The collaboration between Moonshot AI and Tsinghua University not only provides new engineering ideas for large-scale AI inference but also lays the technical foundation for building future cross-regional computing networks. This "Prefill-as-a-Service" model may become an important turning point for large models to move towards industrial application.

Moonshot AI Collaborates with Tsinghua University to Launch PrfaaS Architecture, Breaking the Bottleneck of Large Model Computing Power

Technical Breakthrough: "Surgical" Separation of Prefill and Decode

Core Innovation: Efficient Collaboration Across Regions

Test Results: Dual Optimization of Throughput and Latency

Related Recommendations

Managing AI with AI: Reddit Upgrades Its Automated System to Block 23 Million Pieces of Spam Content Daily

Early Signs of Commercialization: Huang Zhenxin from Moonshot Explains Kimi's Differentiation Strategy

Apple and LM Studio Achieve a Breakthrough Collaboration: Four Mac Studios Successfully Run Trillion-Parameter Large Model

Major Upgrade to Kimi Code Open Source Coding Agent: CLI One-Click Installation + Video Context Support

Gu Quanquan Confirms Departure from ByteDance's Seed Team, Previously Led the Development of SeedFold and Seed2.0 Training System