The highly anticipated domestic AI company, Moonshot AI, recently announced the open-source release of two new vision-language models: Kimi-VL and Kimi-VL-Thinking. These models, with their lightweight architecture and excellent multi-modal understanding and reasoning capabilities, have surpassed numerous large models, including GPT-4o, in several key benchmark tests, attracting significant industry attention.

Lightweight Design, Powerful Performance
Unlike mainstream large models with hundreds of billions or even trillions of parameters, Kimi-VL and Kimi-VL-Thinking utilize a MoE (Mixture-of-Experts) architecture with approximately 3 billion activated parameters. This results in more efficient operation and deployment, requiring fewer computational resources. Surprisingly, despite this lightweight architecture, both models achieved remarkable results in various benchmark tests, showcasing their impressive reasoning capabilities.
Multi-modal Intelligence Upgraded: Exceptional Performance in Mathematical Reasoning and Agent Operations
The Kimi-VL series excels in multi-modal reasoning and agent capabilities. In the MathVision benchmark, which tests multi-modal mathematical reasoning, Kimi-VL achieved a score of 36.8% – comparable to larger models with ten times more parameters.
Even more impressive is its performance on the ScreenSpot-Pro task, which evaluates agent operation capabilities. Kimi-VL scored 34.5%, demonstrating its potential for understanding complex user interfaces and executing corresponding actions, laying the foundation for future intelligent human-computer interaction applications.

High-Resolution Vision: Native Support for High-Resolution Image Processing
Thanks to the MoonViT architecture, the Kimi-VL series boasts strong image and text recognition and understanding capabilities. In the OCRBench benchmark, it achieved a score of 867, demonstrating its superior performance in handling high-resolution images and recognizing complex text. This is crucial for applications dealing with large amounts of image and document data.
Extended Memory: Effortless Handling of Long Contexts
Long context understanding is another highlight of the Kimi-VL series. They support context inputs up to 128K tokens. This means the models can process longer documents, videos, and other complex long-text information simultaneously for in-depth understanding and analysis.
In the long document understanding test, MMLongBench-Doc, Kimi-VL achieved 35.1%, while in the long video understanding test, LongVideoBench, it scored an impressive 64.5%. This gives the Kimi-VL series significant potential in applications such as document question answering and video analysis that require processing large amounts of contextual information.
Open-Source Sharing, Co-creating the Future of Multi-modal Intelligence
Moonshot AI emphasizes that the open-source release of Kimi-VL and Kimi-VL-Thinking is just a small step towards general multi-modal intelligence. They hope to attract more community developers to participate in model application development through open-source collaboration, jointly exploring the possibilities of the Kimi-VL series in areas such as document question answering, interface operation, image and text understanding, and video analysis.
Developers can access Kimi-VL series information and code via:
GitHub: https://github.com/MoonshotAI/Kimi-VL
https://huggingface.co/moonshotai/Kimi-VL-A3B-Instruct
