The Qwen Pilot team from Alibaba Tongyi Lab recently introduced a new algorithm called FIPO (Future-KL Influenced Policy Optimization), which aims to overcome the bottlenecks faced by large models during the reasoning process. Traditional reinforcement learning methods (RLVR) often fail to distinguish which tokens are critical for the final result when processing each token in a reasoning chain. Therefore, how to accurately identify key tokens has become an urgent problem.

The FIPO algorithm introduces the Future-KL mechanism, which specifically rewards tokens that have a significant impact on subsequent reasoning, thereby solving the "reasoning length stagnation" issue in pure RL training. In practical tests, FIPO outperformed models of similar scale such as o1-mini and DeepSeek-Zero-MATH under a 32B pure RL setup.

According to the team's research results, most tokens show little change before and after training, indicating that the impact of reinforcement learning is extremely sparse. The team found that commonly used evaluation metrics in the industry, such as entropy and KL divergence, are difficult to accurately identify changes in key tokens. Therefore, they introduced a new observation dimension —— the difference in log probability of symbol pairs (Δlog p), effectively capturing the directionality of optimization.
In the experiment, the FIPO algorithm was tested on the zero-shot model Qwen2.5-32B-Base, breaking through the bottleneck of reasoning length, with an average reasoning length exceeding 10,000 tokens. At the same time, the algorithm also achieved a significant improvement in reasoning accuracy, proving its potential in complex mathematical reasoning.
Key points:
🔍 FIPO algorithm is developed by Alibaba Tongyi Lab, aiming to enhance the reasoning ability of large models.
📈 This algorithm can accurately identify tokens that have a significant impact on reasoning, breaking through the reasoning length bottleneck.
🧠 Experiments show that FIPO performs significantly better than traditional algorithms in complex mathematical reasoning.
