As large language models (LLMs) increasingly require computational resources during the reasoning process, traditional service architectures face bottlenecks. Moonshot AI and a research team from Tsinghua University have recently introduced a new architecture called Pre-filling as a Service (PrfaaS), designed to break the limitations of data centers and computing resources in large language model services.

image.png

Currently, the reasoning process of large language models is typically divided into two stages: pre-filling and decoding. The pre-filling stage is a computationally intensive process where the model processes the input and generates a key-value cache (KVCache). The decoding stage is a memory bandwidth-intensive process where the model generates outputs one by one. Traditional architectures require both stages to be completed within the same data center, which imposes limitations in terms of computation and bandwidth.

PrfaaS achieves efficient service across data centers by offloading the pre-filling tasks to dedicated high-computing clusters and using a general Ethernet network to transmit the generated KVCache to local decoding clusters. Studies show that this architecture significantly improves processing performance, with a 54% increase in service throughput compared to traditional models. In practical case studies, the architecture also demonstrates lower latency and higher efficiency.

The design of the PrfaaS architecture separates the three major subsystems—computing, networking, and storage—and manages them independently. A precise routing mechanism ensures that long requests are transmitted efficiently, avoiding congestion caused by uneven resource allocation in traditional methods. At the same time, the system introduces a dual timescale scheduling mechanism to handle changes in different traffic patterns, further optimizing resource utilization.

With the increasing demand for cross-data center reasoning and the continuous emergence of new hardware, PrfaaS undoubtedly provides a new solution for future AI applications.