Just weeks after dropping a "big bomb" in the open-source model field, Google has once again given its strongest open-source model, Gemma4, a powerful "boost." On May 5th local time, Google officially released a Multi-Token Prediction (MTP) draft generator for the Gemma4 series model. This technological breakthrough uses a speculative decoding architecture to increase the model's reasoning speed by up to three times without sacrificing output quality or logical capabilities.
As one of the most closely watched open-source models globally, Gemma4 has already seen over 60 million downloads shortly after its release. The core goal of this update is to address the long-standing reasoning bottleneck in large language models and further squeeze the efficiency of computing resources.
Technical Breakdown: How to Achieve "Foreseeing" Reasoning Acceleration?
Traditional language model reasoning is often limited by memory bandwidth. Simply put, when the processor generates text, it takes a lot of time to move hundreds of billions of parameters from memory to the computing unit. This "transfer" speed is much slower than the computing speed, causing hardware resources to be idle for most of the time, resulting in noticeable response delays.
To tackle this pain point, Google introduced speculative decoding technology. Its working principle can be understood as a "master-slave" mode: the system pairs heavy target models such as Gemma 4 31B with lightweight MTP drafters. The drafter predicts multiple upcoming tokens in advance using idle computing power, and then the more powerful main model performs parallel verification. Once the prediction matches, the model can confirm the entire sequence in a single computation, significantly shortening the time required to generate text.
Test Results: Apple Silicon and Consumer Graphics Cards Benefit Significantly
According to official test data, this acceleration effect is particularly prominent on local devices. In the Apple Silicon chip environment, when batch sizes are set between 4 and 8, the local running speed of the Gemma 4 26B model increased by about 2.2 times.
This means that developers can now run complex offline programming assistants or intelligent agent workflows more smoothly on personal computers or common consumer graphics cards. Additionally, due to the improved reasoning efficiency, the power consumption of edge devices has also been significantly reduced, clearing the way for the popularization of mobile AI applications.
The Boundaries of AI Applications Are Expanding Again
This technical update mainly targets scenarios with extremely high requirements for low latency, such as instant chatbots, automated programming tools, and various autonomous agents. Through the MTP drafter, Google has demonstrated that even in hardware environments with limited resources, developers can still deploy the most advanced language models without having to choose between response speed and computing accuracy.
As the cost and barriers to reasoning continue to decrease, the evolution of Gemma4 and its supporting technologies is pushing AI from the cloud to a broader range of personal computing terminals.
