StepZen recently announced the open source of its latest multimodal vision-language model Step3-VL-10B. This model, with only 10B parameters, has demonstrated competitive performance across multiple benchmarks, successfully solving the industry challenge of achieving high intelligence levels with a small parameter count.

image.png

In core performance tests, Step3-VL-10B not only reached SOTA levels in visual perception, logical reasoning, and math competitions, but also matched or even surpassed open-source models that are 10 to 20 times larger (such as Qwen3-VL-Thinking235B) and top-tier closed-source flagship models. Relying on full-parameter end-to-end multimodal joint pre-training and large-scale reinforcement learning iteration, this model has entered the first tier in high-difficulty math competitions such as AIME.

This open source includes Base and Thinking versions. Thanks to the innovative parallel coordination reasoning mechanism (PaCoRe), the model performs particularly stably in tasks such as high-precision OCR, complex counting, and spatial topological understanding. This means that complex multimodal reasoning capabilities that previously required cloud computing can now be deployed more cost-effectively on edge devices like phones and computers, greatly improving the interaction efficiency of edge-side agents.

  • Project Homepage: https://stepfun-ai.github.io/Step3-VL-10B/

  • Paper Link: https://arxiv.org/abs/2601.09668

  • HuggingFace: https://huggingface.co/collections/stepfun-ai/step3-vl-10b

  • ModelScope: https://modelscope.cn/collections/stepfun-ai/Step3-VL-10B

Key Points:

  • 🚀 Small Parameters Outperforming Larger Models: Step3-VL-10B challenges and surpasses 200B-scale models with 10B parameters, achieving an optimal leverage ratio of performance and scale.

  • 🧠 Deep Logic and Perception: Introducing the PaCoRe mechanism and large-scale reinforcement learning, it reaches world-class levels in competition-level mathematics, complex GUI perception, and 3D spatial reasoning.

  • 📱 Edge Intelligence Deployment: Supports high-performance multimodal capabilities on low-computing-power devices, providing a strong foundation for "active understanding and interaction" in smartphones and industrial embedded devices.