AIDC AI Team Launches Ovis2.5: New Breakthrough in Economical Visual Reasoning Models

The AI team of Alibaba International Digital Trade Group (AIDC), AIDC-AI, recently released the new multimodal large language model Ovis2.5, available in 9B and 2B parameter scale versions. Positioned as an economical visual reasoning solution, the model demonstrates outstanding performance within its scale, setting a new benchmark for multimodal AI applications.

Key Features of Ovis2.5

1. **Natively Resolution-Aware**: Ovis2.5 uses the NaViT visual encoder, which retains fine details and global structure of images without tiling loss, ensuring high-quality visual processing capabilities.

2. **Deep Reasoning Capabilities**: The model supports an optional "thinking mode," possibly partially reusing technical features from Alibaba's Qwen3. In addition to linear thinking chain (CoT) reasoning, Ovis2.5 can perform self-checking and revision, and supports configurable thinking budgets to improve the accuracy of problem-solving.

3. **Leading Chart and Document OCR**: On both 9B and 2B scales, Ovis2.5 achieves industry-leading levels in complex chart analysis, document understanding (including tables and forms), and optical character recognition (OCR), providing strong support for real-world application scenarios.

4. **Broad Task Coverage**: The model performs well in image reasoning, video understanding, and visual localization benchmark tests, demonstrating strong general multimodal capabilities.

The release of Ovis2.5 highlights AIDC-AI's continuous innovation in the field of multimodal AI technology. By achieving high performance within a compact model size, Ovis2.5 provides developers and enterprises with an efficient and easy-to-deploy solution, particularly suitable for scenarios that require the integration of visual and text reasoning. The model is open-sourced on platforms such as GitHub and Hugging Face, further promoting collaboration and innovation within the global AI community.

This release marks another significant advancement by AIDC-AI based on the Ovis series of models, injecting new vitality into the development of multimodal large language models.

NVIDIA Unveils Multimodal LLM Describe Anything: Generating Detailed Descriptions of Specific Regions

The NVIDIA AI team has released a revolutionary multimodal large language model—Describe Anything 3B (DAM-3B)—designed for detailed, region-specific descriptions of images and videos. This model, with its innovative technology and superior performance, has generated significant discussion in the multimodal learning field, marking another milestone in AI development. Below, AIBase outlines the model's core highlights and industry impact. A breakthrough in region-specific descriptions, DAM-3B stands out for its unique ability to...

Ali International Open Source Ovis2 Series Multimodal Large Language Model with Six Versions

Ovis2 is the latest version of the Ovis series models proposed by Alibaba's international team. Compared to the previous version 1.6, Ovis2 has significant improvements in data construction and training methods. It not only enhances the capacity density of small models but also greatly improves chain of thought (CoT) reasoning capabilities through instruction fine-tuning and preference learning. Additionally, Ovis2 introduces video and multi-image processing capabilities, and enhances multilingual abilities and OCR capabilities in complex scenarios, significantly increasing the model's practicality.

NVIDIA Launches New Visual Speech Model NVEagle, Capable of Chatting with Images

NVIDIA has collaborated with several universities to introduce NVEagle, a large visual language model capable of chatting using images. NVEagle can analyze image content and provide accurate answers, such as identifying individuals in images, like Jensen Huang. The model significantly enhances the understanding of visual information by transforming images into visual tokens and combining them with text embeddings. In addressing the challenges of high-resolution image processing, the research team has constructed models like Eagle-X5-7B and Eagle-X by exploring various visual encoders and fusion strategies.

Tencent Launches First Open Source Multimodal Large Language Model VITA for Seamless Communication with Users

Tencent's YouTu Lab and other institutions have released the first open source multimodal large language model VITA, aimed at bridging the gap in processing Chinese dialects. Based on the Mixtral8×7B model, VITA expands the Chinese vocabulary and undergoes bilingual instruction fine-tuning, mastering both English and Chinese. Key features include: 1. **Multimodal Understanding**: VITA can handle video, images, text, and audio, which is unprecedented among open source models. 2. **Natural Interaction**: No specific wake words are required, allowing for instant response while maintaining polite and non-intrusive communication.

AIDC AI Team Launches Ovis2.5: New Breakthrough in Economical Visual Reasoning Models

Related Recommendations

Ali International Launches Multimodal Large Model Ovis2.5, Driving New Advances in Visual Perception and Deep Reasoning

NVIDIA Unveils Multimodal LLM Describe Anything: Generating Detailed Descriptions of Specific Regions

Ali International Open Source Ovis2 Series Multimodal Large Language Model with Six Versions

NVIDIA Launches New Visual Speech Model NVEagle, Capable of Chatting with Images

Tencent Launches First Open Source Multimodal Large Language Model VITA for Seamless Communication with Users