The bottleneck in AI inference efficiency is about to see a new technological breakthrough. On June 28, Peking University and DeepSeek officially announced the joint launch and open-sourcing of a large model inference acceleration framework called DSpark, aimed at solving the issues of response latency and computing resource waste caused by frequent forward computations in high-concurrency inference scenarios for large language models.

In the standard autoregressive generation process of large language models, the system consumes full computing resources for each output token, directly limiting the real-time response speed of conversations. Although speculative decoding is currently the mainstream method to accelerate the process, traditional solutions have clear shortcomings: simple models generate sequentially and take too long, while parallel models often suffer from a decrease in candidate acceptance rate when handling long sequences, leading to a lot of computing resources being wasted inefficiently.

image.png

To address these pain points, DSpark introduces a dual-optimization mechanism. In the candidate generation phase, it adopts a semi-autoregressive architecture, using a parallel backbone network to output high-quality base features at once, and then optimizes the text logic with a lightweight module. It only needs two layers of Transformer structure to achieve better performance than a five-layer parallel model, striking a delicate balance between speed and quality. In terms of validation scheduling, the framework introduces a confidence-based scheduling verification mechanism, where a hardware-aware prefix scheduler dynamically assesses computing load and prioritizes processing reliable text segments, thus minimizing unnecessary computation as much as possible.

Through rigorous testing on mainstream models such as Tongyi Qianwen 3 and Gemma4 in multiple scenarios including code writing, mathematical reasoning, and daily conversation, DSpark has shown impressive results. Compared to two industry mainstream baseline models, Eagle3 and DFlash, it shows a clear advantage in single-round effective generation length, especially in long-sequence generation tasks, where it significantly alleviates the problem of declining candidate effectiveness rate.

In terms of engineering implementation, the development team conducted deep-level system optimization, including adopting sequence packing to reduce memory consumption, designing an asynchronous scheduling mode to eliminate GPU pipeline stalls, and ensuring compatibility with the mainstream CUDA hardware ecosystem. Currently, DSpark has been first deployed in the DeepSeek-V4-Flash and DeepSeek-V4-Pro preview service engines. Test data shows that regardless of the response speed standard, the system's overall throughput has achieved a significant leap.

It is reported that DeepSeek has open-sourced the complete training code, model weights, and evaluation tools for DSpark, DFlash, and Eagle3 on the GitHubDeepSpec project. This move will greatly reduce the deployment cost of high-performance inference services in the industry and provide a practical technical model for the low-cost popularization of large models.