With the rapid development of multimodal large language models (MLLMs), how to enable models to evolve from "passively understanding images" to "actively seeking evidence and reasoning" has become the core of competition in the AI field. However, due to the lack of high-quality training data, automated trajectory synthesis paths, and detailed training recipes, top-tier multimodal search agents have been difficult to reproduce in the open-source community.
To break this deadlock, a research team from Tencent Hunyuan, in collaboration with the University of California, Los Angeles (UCLA) and The Chinese University of Hong Kong, has officially released OpenSearch-VL. This is a fully open-source roadmap aimed at building a cutting-edge deep search agent using reinforcement learning (RL) technology.

Innovative Data Production Line, Overcoming the "Search Shortcut"
The research team pointed out that the biggest bottleneck hindering model evolution lies in high-quality training data. To train a model capable of multi-step reasoning rather than just "one-click image recognition," the team developed a refined data consolidation process.
This process uses the hyperlink graph of Wikipedia for path sampling, transforming complex entity relationships into multi-hop questions and answers. To prevent the model from "cheating," researchers used fuzzy entity rewriting techniques to hide direct answers and introduced visual localization technology based on source code anchors. This design forces the model to first identify visual cues and then retrieve step by step using external tools, thus avoiding functional breakdowns during the retrieval process. Based on this, the team built the SearchVL-SFT database containing 36,000 instruction fine-tuning trajectories and the SearchVL-RL database with 8,000 trajectories for reinforcement learning.
Powerful Toolbox: More Than Just Search
OpenSearch-VL is not limited to simple text search. In real scenarios, the images provided by users often suffer from blurriness, distortion, or low resolution, which can make search tools ineffective.
To address this, the project integrates a diverse tool environment, including web search, reverse image search, OCR (Optical Character Recognition), image cropping, sharpening, super-resolution reconstruction, and perspective correction. This means the agent will actively perceive and repair imperfect visual inputs, just like humans do, before querying external knowledge to ensure the accuracy of subsequent searches.
"Fault-Aware" Algorithm: Letting the Model Learn from Failure
In long-path tasks, tool calls often trigger chain reactions. A timeout or error in one step can lead to the entire task failing. Traditional reinforcement learning often discards these failed trajectories, causing waste of training resources.
OpenSearch-VL proposes a training algorithm called "Multi-round Fault-Aware GRPO." This algorithm can sensitively detect the "fatal points" of tool calls, filter out invalid information after failures using masking techniques, and retain useful logic before the failure through one-sided advantage clamping. This ensures that even if the model ultimately fails, it can still learn effective search paths and exploration strategies from earlier stages.
Experimental Performance Comparable to Commercial Proprietary Models
Test results show that OpenSearch-VL performs exceptionally well in seven mainstream multimodal deep search benchmarks, with an average performance improvement of over 10 percentage points. In some specific tasks, its performance is already comparable to the current top closed-source commercial models.
Currently, the research team plans to fully open-source all training data, code, and model weights of OpenSearch-VL, aiming to provide global developers with a reproducible and improvable underlying framework, pushing multimodal agent research into the "deep waters."
Paper link: https://arxiv.org/pdf/2605.05185
