The performance bottleneck of large language models (LLMs) is being broken by new technologies. Recently, the research team from Moonshot AI (Moonshot) and Tsinghua University jointly proposed a new architecture called **Prefill-as-a-Service (PrfaaS)**. This study aims to address hardware limitations faced when deploying large model services in data centers by optimizing the allocation of computing resources, significantly improving inference efficiency.

Technical Breakthrough: "Surgical" Separation of Prefill and Decode
Currently, the inference process of large language models mainly consists of two distinct phases:
Prefill Phase: It is computation-intensive, responsible for processing input and generating key-value cache (KVCache).
Decode Phase: It is memory bandwidth-intensive, responsible for generating output word by word.
In traditional service architectures, these two phases are usually processed within the same data center or even the same server. Due to their differing hardware resource requirements, this "forced bundling" often leads to an imbalance in the allocation of computing resources and bandwidth, resulting in service congestion.
Core Innovation: Efficient Collaboration Across Regions
This design breaks the physical space constraints, allowing prefill and decoding to be performed simultaneously in different data centers. To ensure efficient transmission, PrfaaS introduces a two-time-scale scheduling mechanism. This mechanism can flexibly allocate resources according to real-time traffic fluctuations, combined with precise routing mechanisms, ensuring that long text requests do not experience delays due to uneven resource distribution during transmission.
Test Results: Dual Optimization of Throughput and Latency
Research data shows that the PrfaaS architecture performs remarkably in practical applications:
Service throughput increased by 54%, greatly enhancing the ability to handle requests per unit time.
Significantly reduced response latency, making the first character generation faster from the user's perspective.
Maximized resource utilization, by separating the computing, network, and storage subsystems, it avoids the congestion problems in traditional architectures.
The collaboration between Moonshot AI and Tsinghua University not only provides new engineering ideas for large-scale AI inference but also lays the technical foundation for building future cross-regional computing networks. This "Prefill-as-a-Service" model may become an important turning point for large models to move towards industrial application.
