The ModelScope community announced that the new multimodal model MiniCPM-V4.0, known as "Moxie Xiaogangpao," has been officially open-sourced. With 4B parameters, this model has achieved state-of-the-art (SOTA) results on multiple benchmarks such as OpenCompass, OCRBench, and MathVista, and it runs stably and smoothly on mobile devices like phones. Meanwhile, the official also open-sourced the inference deployment tool MiniCPM-V CookBook, helping developers achieve lightweight and easy deployment for different needs, scenarios, and devices out of the box.

MiniCPM-V4.0's open-sourcing marks an important step in the field of edge-side applications for multimodal models. As the most suitable model size for running on phones, MiniCPM-V4.0 achieves stable operation and fast response with 4B parameters, without overheating or lagging even after long-term continuous use. Currently, an iOS app supporting local deployment of MiniCPM-V4.0 has been open-sourced, and developers can download and use it in the CookBook.

微信截图_20250807093312.png

In terms of performance, MiniCPM-V4.0 has reached SOTA levels in the multimodal capabilities of the 4B parameter scale. On evaluation benchmarks such as OpenCompass, OCRBench, MathVista, MMVet, MMBench V1.1, MMStar, AI2D, and HallusionBench, MiniCPM-V4.0 has the highest performance among its peers. Particularly on the OpenCompass evaluation, the comprehensive performance of MiniCPM-V4.0 surpasses the Qwen2.5-VL3B model and the InternVL2.54B model, even rivaling GPT-4.1-mini and Claude3.5Sonnet. Compared to the previous generation MiniCPM-V2.6 8B model, MiniCPM-V4.0 significantly improves multimodal capabilities while halving the model parameters.

MiniCPM-V4.0 is able to smoothly and fluidly complete real-time video understanding, image understanding, and other tasks on edge-side devices such as phones and PCs, not only due to its excellent performance but also because of its unique model structure design. This design achieves the fastest first response time and lower VRAM usage among models of the same size. According to testing on Apple M4 Metal, the VRAM usage when running the MiniCPM-V4.0 model is only 3.33GB, which is lower than models like Qwen2.5-VL3B and Gemma3-4B. In image understanding tests, MiniCPM-V4.0 greatly shortens the first response time by using ANE + Metal auxiliary acceleration. The advantage of faster first response becomes more pronounced as the input image resolution increases.

Additionally, the research team tested the model's concurrency and throughput using two 4090 GPUs. The experimental results showed that within the range of available computing resources, as the number of concurrent users increases, the total throughput advantage of the MiniCPM-V4.0 model becomes more significant. For example, under 256 concurrent user demands, the throughput of MiniCPM-V4.0 reaches 13856 tokens/s, far exceeding the 7153 tokens/s of Qwen2.5-VL and the 7607 tokens/s of Gemma3.

Github: 🔗 https://github.com/OpenBMB/MiniCPM-o

Hugging Face: 🔗 https://huggingface.co/openbmb/MiniCPM-V-4

ModelScope: 🔗 https://modelscope.cn/models/OpenBMB/MiniCPM-V-4

CookBook: 🔗 https://github.com/OpenSQZ/MiniCPM-V-CookBook