Multimodal large models (MLLMs) are gradually showing great potential in solving complex problems. However, these models often appear "rigid" when dealing with complex reasoning, lacking the ability to reflect, which makes it difficult for them to backtrack when facing challenges that require multiple attempts. To address this issue, a research team from Shanghai Jiao Tong University and the Shanghai Artificial Intelligence Laboratory has launched an innovative project called MM-HELIX, aiming to enable AI to learn long-chain reflective reasoning like humans.

MM-HELIX is not just a project, but a comprehensive ecosystem. The team first built a benchmark test called "Ultimate Exam" for MM-HELIX to evaluate the reflective reasoning ability of multimodal large models. This benchmark test involves 42 highly complex tasks, covering areas such as algorithms, graph theory, puzzles, and strategy games. Test results show that even the most advanced models have low accuracy, especially under multimodal input, where performance is even worse. This result undoubtedly highlights the importance of improving AI's reflective abilities.

image.png

To help multimodal large models better learn reflection, the research team also created a dataset called MM-HELIX-100K, containing 100,000 high-quality samples, aimed at teaching models how to reflect and review through a "step-by-step heuristic response generation" (SERG) process. This process significantly reduced problem-solving time and effectively reduced unnecessary redundant thinking.

image.png

In addition, the team proposed an adaptive hybrid policy optimization algorithm (AHPO), serving as an intelligent tutor to help models gradually shift from relying on expert guidance to self-exploration during the learning process. This dynamic teaching mechanism allows models to improve accuracy while also developing independent thinking skills.

Through these innovations, the Qwen2.5-VL-7B model equipped with MM-HELIX achieved an 18.6% increase in accuracy on the benchmark test. This progress not only broke through the original model's bottlenecks but also demonstrated the strong generalization of reflective ability, proving the significant importance of this project for AI development.