Article Content

Breaking the Bottleneck! Shanghai Jiao Tong University and Shanghai AI Lab Collaborate to Enhance the Reflective Ability of Multimodal Large Models

Published in Latest AI News

Time :Oct 21, 2025

Read :4minute

Multimodal large models (MLLMs) are gradually showing great potential in solving complex problems. However, these models often appear "rigid" when dealing with complex reasoning, lacking the ability to reflect, which makes it difficult for them to backtrack when facing challenges that require multiple attempts. To address this issue, a research team from Shanghai Jiao Tong University and the Shanghai Artificial Intelligence Laboratory has launched an innovative project called MM-HELIX, aiming to enable AI to learn long-chain reflective reasoning like humans.

MM-HELIX is not just a project, but a comprehensive ecosystem. The team first built a benchmark test called "Ultimate Exam" for MM-HELIX to evaluate the reflective reasoning ability of multimodal large models. This benchmark test involves 42 highly complex tasks, covering areas such as algorithms, graph theory, puzzles, and strategy games. Test results show that even the most advanced models have low accuracy, especially under multimodal input, where performance is even worse. This result undoubtedly highlights the importance of improving AI's reflective abilities.

To help multimodal large models better learn reflection, the research team also created a dataset called MM-HELIX-100K, containing 100,000 high-quality samples, aimed at teaching models how to reflect and review through a "step-by-step heuristic response generation" (SERG) process. This process significantly reduced problem-solving time and effectively reduced unnecessary redundant thinking.

In addition, the team proposed an adaptive hybrid policy optimization algorithm (AHPO), serving as an intelligent tutor to help models gradually shift from relying on expert guidance to self-exploration during the learning process. This dynamic teaching mechanism allows models to improve accuracy while also developing independent thinking skills.

Through these innovations, the Qwen2.5-VL-7B model equipped with MM-HELIX achieved an 18.6% increase in accuracy on the benchmark test. This progress not only broke through the original model's bottlenecks but also demonstrated the strong generalization of reflective ability, proving the significant importance of this project for AI development.

Related Recommendations

Shanghai AI Lab Releases the Multimodal Large Model Shuengwan InternVL3.5

Shanghai AI Lab opens InternVL3.5, a multimodal model with innovations like cascaded RL and dynamic vision routing, offering 1B-241B versions with top performance.....

Sep 1, 2025

171.3k

X-SAM: Breaking the Boundaries of Image Segmentation Achieving a New Breakthrough in Arbitrary Segmentation

X-SAM, a multimodal model for image segmentation, advances from 'segment anything' to 'segment everything', enhancing precision and flexibility via unified I/O formats supporting text/visual queries.....

Aug 19, 2025

86.1k

Xiaomi Unveils Another AI Star! Open-Source Multimodal Large Model MiMo-VL-7B-2508 with Significant Performance Improvement, Supports Thinking Mode Switching

Xiaomi open-sources MiMo-VL-7B-2508, a multimodal model with SFT/RL versions. Features 'thinking mode' switching, improved RL stability, and leading scores in MMMU/ChartQA benchmarks.....

Aug 12, 2025

117.0k

Xiaomi Opensources Latest Multimodal Large Model Xiaomi MiMo-VL-7B-2508

The Xiaomi large model team announced the open source of the latest multimodal large model Xiaomi MiMo-VL-7B-2508, which includes two versions: RL and SFT. Official data shows that the new model has set new records in four core capabilities: subject reasoning, document understanding, graphical interface positioning, and video understanding. Among them, the MMMU benchmark has broken through the 70-point mark for the first time, ChartQA has risen to 94.4, ScreenSpot-v2 has reached 92.5, and VideoMME has improved to 70.8.

Aug 9, 2025

112.4k

The Compass Arena, a Large Model Evaluation Platform, Adds a Multi-Modal Large Model Competition Section

The Sinan OpenCompass team at the Shanghai Artificial Intelligence Laboratory has collaborated with the Modao ModelScope to launch the Compass Multi-Modal Arena, a new section of a large model evaluation platform focusing on multi-modal large models. Users can upload images and input questions to have two anonymous multi-modal large models generate answers, which can then be subjectively evaluated based on the quality of the generated content, allowing users to select the better-performing model. The platform offers an easy-to-use interface and a unique question bank.

Aug 13, 2024

189.3k

Intelligent Future, Your Artificial Intelligence Solution Think Tank

English 简体中文繁體中文にほんご