According to reports, the Qwen Pilot team from
Core Breakthrough: Solving "Reasoning Length Stagnation"
Traditional models often struggle to distinguish which Tokens are key to reaching the correct answer when dealing with complex problems like mathematics. FIPO addresses this issue with targeted reengineering:
Future-KL Mechanism: Introduces the Future-KL strategy, specifically rewarding Tokens that have a significant positive impact on subsequent reasoning, enabling AI to "think ahead."
Symbolic Log Probability Difference: Introduces this new mechanism to precisely capture the model's optimization direction, preventing the reasoning process from getting stuck in unproductive loops.
Reasoning Length Leap: On a base model, FIPO successfully increased the average reasoning length to over 10,000 Tokens, completely solving the problem of insufficient reasoning depth.
Outstanding Performance: 32B Model Surpasses o1-mini
In practical tests, the 32B-scale model equipped with the FIPO algorithm demonstrated remarkable "powerful" performance:
Surpassing Competitors: In a pure reinforcement learning setup, its reasoning performance not only surpassed models of the same scale but also outperformed
Mathematical Potential: The algorithm performs exceptionally well in handling high-level mathematical reasoning tasks, showcasing strong logical deduction capabilities.
Industry Context: Tongyi Lab's "Intelligent Evolution"
Conclusion: The "Second Curve" of Reasoning Efficiency
While the industry is still debating parameter scale,
