When the AI model arms race makes computing power unaffordable, Mira Murati, former CTO of OpenAI, leads the Thinking Machines Lab with a breakthrough technology called "On-Policy Distillation," which is resetting the industry. Recent research shows that a small model with only 8 billion parameters can achieve 70% of the performance of a 32B model after being trained with this method, while the training cost drops by 90%, and efficiency increases by 50 to 100 times. This means that small and medium enterprises, as well as individual developers, can also train specialized AI at extremely low costs that rival those of big companies.
50-100x Efficiency Jump: 150 Steps Outperform 18,000 GPU Hours
Traditional reinforcement learning (RL) training often requires thousands of steps and massive computing power. For example, in the math reasoning task AIME'24, a pure RL method consumed 17,920 GPU hours and achieved an accuracy rate of only 68%. However, the Qwen3-8B model using on-policy distillation achieved a 70% accuracy rate with just 150 training steps, with almost negligible computational cost.

The core lies in the "dense feedback per token" mechanism: unlike RL, which only gives sparse rewards at the end of each episode, on-policy distillation allows the teacher model to provide real-time scores for each token generated by the student, offering continuous and precise guidance signals. This not only accelerates convergence but also effectively prevents "policy drift" during long sequence training, ensuring that the small model consistently produces high-quality results under limited resources.
Solving "Catastrophic Forgetting": Learning New Knowledge Without Losing Old Skills
AI models often "forget" their original abilities when they are injected with new knowledge. Experiments show that after fine-tuning with internal documents, a model's instruction following ability dropped from 85% to 45%. However, on-policy distillation through real-time trajectory sampling and gradual teacher correction retains 41% of the new knowledge while quickly restoring the original ability to 83%, far exceeding traditional fine-tuning or offline distillation.
This feature makes it particularly suitable for enterprise scenarios: the model can dynamically learn business rules and product documentation without losing core capabilities such as basic conversation and tool calling, truly achieving "continuous evolution."
Four-Step Loop: Simple Architecture, Accessible Implementation
This method is extremely lightweight, requiring only four steps in a loop:
Deploy a teacher model (such as a 32B model) as a supervision source;
The student model generates response trajectories;
The teacher calculates the log probability of each token;
Optimize the student parameters using reverse Kullback-Leibler divergence as the loss.
No complex infrastructure is needed; it is compatible with existing distillation frameworks, enabling a "cost-effective and accurate" performance jump. The paper states that this technology can be seamlessly extended to tasks such as code generation and multimodal reasoning, opening up new paths for "teacher-student" collaborative training.
Mira Murati's "Downgrade Strike": The Key to AI Democratization
As the former CTO of OpenAI, Murati brings her practical experience in large model training back to build an efficient small model ecosystem. In today's era where AI safety and alignment are increasingly important, on-policy distillation not only improves efficiency but also enhances the predictability of model behavior through controlled knowledge transfer.
Industry experts predict that this technology will greatly promote the development of open-source models and edge AI. When an 8B model can handle 32B tasks, mobile phones, IoT devices, and even local servers will become carriers of high-performance AI. Intelligence is moving from "cloud monopoly" to "accessible to everyone."
This training revolution sparked by Murati may be the turning point for AI to shift from a "giant's game" to "a common tool." When small models can be as smart as large models, the true era of intelligent democratization has just begun.
