At today's 2025 Xiaomi Human-Car-Home Ecosystem Partner Conference, Luo Fuli, the new head of the Xiaomi MiMO large model, made her debut and officially announced the latest MoE (Mixture of Experts) large model — MiMo-V2-Flash. This new model is considered the second step for Xiaomi in achieving its goal of Artificial General Intelligence (AGI).

Luo Fuli introduced the technical architecture of MiMo-V2-Flash in detail on social media. The model adopts a Hybrid SWA architecture, which is simple and elegant, and shows significantly better performance in long-context reasoning compared to other linear attention variants. Notably, a window size of 128 is considered optimal, as larger windows can actually reduce model performance. Meanwhile, the fixed KV cache design improves compatibility with existing infrastructure.

In addition, Luo Fuli also discussed a key technology — Multi-Token Prediction (MTP). By adopting MTP, the model achieved significant improvements in efficient reinforcement learning (RL). Even beyond the first layer, MTP requires only minimal fine-tuning to achieve high acceptance length. Three-layer MTP performs particularly well in programming tasks, achieving an acceptance length greater than 3 and improving speed by about 2.5 times, effectively solving the GPU idle issue in small-batch On-Policy RL.

In the post-training phase, Xiaomi adopted On-Policy Distillation proposed by Thinking Machine, aiming to integrate multiple RL models. Through this method, Xiaomi successfully achieved the performance of the teacher model with only 1/50 of the computational cost in traditional SFT and RL processes. This process demonstrates the potential for continuous evolution of the student model, ultimately forming a self-reinforcing loop.

Luo Fuli stated that the team transformed these ideas into production systems within just a few months, showcasing extraordinary efficiency and creativity.

Key Points:

🌟 MiMo-V2-Flash is the second step for Xiaomi in achieving AGI goals, featuring an advanced MoE architecture.  

⚡ The multi-token prediction technology significantly enhances model performance and speed.  

💡 In the post-training phase, multiple RL models are integrated, demonstrating strong self-reinforcement capabilities.