Kimi Open-Source Vision-Language Model Kimi-VL and Kimi-VL-Thinking Surpass GPT-4o on Multiple Benchmarks

The highly anticipated domestic AI company, Moonshot AI, recently announced the open-source release of two new vision-language models: Kimi-VL and Kimi-VL-Thinking. These models, with their lightweight architecture and excellent multi-modal understanding and reasoning capabilities, have surpassed numerous large models, including GPT-4o, in several key benchmark tests, attracting significant industry attention.

Lightweight Design, Powerful Performance

Unlike mainstream large models with hundreds of billions or even trillions of parameters, Kimi-VL and Kimi-VL-Thinking utilize a MoE (Mixture-of-Experts) architecture with approximately 3 billion activated parameters. This results in more efficient operation and deployment, requiring fewer computational resources. Surprisingly, despite this lightweight architecture, both models achieved remarkable results in various benchmark tests, showcasing their impressive reasoning capabilities.

Multi-modal Intelligence Upgraded: Exceptional Performance in Mathematical Reasoning and Agent Operations

The Kimi-VL series excels in multi-modal reasoning and agent capabilities. In the MathVision benchmark, which tests multi-modal mathematical reasoning, Kimi-VL achieved a score of 36.8% – comparable to larger models with ten times more parameters.

Even more impressive is its performance on the ScreenSpot-Pro task, which evaluates agent operation capabilities. Kimi-VL scored 34.5%, demonstrating its potential for understanding complex user interfaces and executing corresponding actions, laying the foundation for future intelligent human-computer interaction applications.

High-Resolution Vision: Native Support for High-Resolution Image Processing

Thanks to the MoonViT architecture, the Kimi-VL series boasts strong image and text recognition and understanding capabilities. In the OCRBench benchmark, it achieved a score of 867, demonstrating its superior performance in handling high-resolution images and recognizing complex text. This is crucial for applications dealing with large amounts of image and document data.

Extended Memory: Effortless Handling of Long Contexts

Long context understanding is another highlight of the Kimi-VL series. They support context inputs up to 128K tokens. This means the models can process longer documents, videos, and other complex long-text information simultaneously for in-depth understanding and analysis.

In the long document understanding test, MMLongBench-Doc, Kimi-VL achieved 35.1%, while in the long video understanding test, LongVideoBench, it scored an impressive 64.5%. This gives the Kimi-VL series significant potential in applications such as document question answering and video analysis that require processing large amounts of contextual information.

Open-Source Sharing, Co-creating the Future of Multi-modal Intelligence

Moonshot AI emphasizes that the open-source release of Kimi-VL and Kimi-VL-Thinking is just a small step towards general multi-modal intelligence. They hope to attract more community developers to participate in model application development through open-source collaboration, jointly exploring the possibilities of the Kimi-VL series in areas such as document question answering, interface operation, image and text understanding, and video analysis.

Developers can access Kimi-VL series information and code via:

GitHub: https://github.com/MoonshotAI/Kimi-VL
https://huggingface.co/moonshotai/Kimi-VL-A3B-Instruct

Study: Global AI Chipset Market to Exceed $700B with 31.8% CAGR

According to TMR Research, the global artificial intelligence chipset market size is expected to exceed $700 billion, with a compound annual growth rate of 31.8% from 2022 to 2031. The article discusses the development trends, application areas, and key players in the artificial intelligence chipset market, which is highly timely and valuable for readers interested in the artificial intelligence chipset market.

IBM Research: How AI & Automation Protect Businesses from Data Breaches

IBM's report provides sufficient evidence that artificial intelligence, automation, and threat intelligence can address data breaches throughout the lifecycle, reduce costs, and provide stronger evidence. The research found that integrating artificial intelligence and automation into security operations teams can reduce the lifecycle of data breaches by 33% and costs by 33.6%. However, currently, only 28% of enterprises widely apply artificial intelligence and automation. Many enterprises rely on legacy systems, which are easily bypassed by attackers. The significance of this article lies in emphasizing the effectiveness of artificial intelligence and automation in improving cybersecurity and calling on enterprises to widely adopt these technologies to protect data security.

Google's AGI Robot Breakthrough: 54 - Member Team's 7 - Month Work, High Generalization and Reasoning 解释：核心关键词为“谷歌AGI机器人”（Google's AGI Robot）和“新成果”（Breakthrough），标题简洁地概括了主要内容，以动词开头，符合英文习惯，且长度在规定范围内。

The robotics research team at Google DeepMind recently released a robotics project called RT-2. This project took 7 months to develop and uses a large model for training. RT-2 has capabilities such as symbol understanding, reasoning, and human recognition, and can think and complete tasks based on human instructions. By combining the large model with the robot's operational capabilities, RT-2 can accomplish tasks that involve logical leaps, such as from 'extinct animals' to 'plastic dinosaurs'. The results of this project performed well in various sub - category tests, with performance up to three times that of the previous generation of robot models. This research result demonstrates the potential of large models in robotics research and is expected to drive the development of robots in the future.

RWKV: Small Team Aims to Be Android of AI Era with Big Model

Meta Intelligence OS is a startup founded by Bloomberg. It has developed a series of large models based on the open-source model RWKV and aims to become the Android in the era of large models. The RWKV model has superior performance and low cost in inference tasks, thus attracting customers from industries such as finance, law firms, and smart hardware. The business model of Meta Intelligence OS is model customization based on private data and internal AI Agent development. The company hopes to solve the problems of API call latency and data security by deploying large models on terminal devices. Currently, RWKV versions are available on Windows, Mac, and Linux computers, and Android and iOS versions are also in development. Meta Intelligence OS is raising funds and collaborating with chip companies and computing power platforms to create benchmark customers. Luo Xuan said that the decisive battlefield for large models is on hardware, and both terminal devices and the cloud require dedicated chips.

Google Sheets Integrates AI Functionality: A New Era for Bulk Data Processing

Artificial intelligence is rapidly permeating everyday productivity tools. Recently, AIbase learned from social media that Google Sheets has officially launched an AI function, providing users with efficient bulk data processing capabilities through the built-in =AI() function. While this feature is currently in alpha testing and available to select users, social media users have already expressed amazement at its performance. Below is AIbase's in-depth report on this update, analyzing its technical highlights and potential impact. Google Sheets