SenseTime NEO Open Source: Achieve Top Multimodal Model Performance with 1/10 of the Data Volume, Ending the Era of Patchwork AI

Sensenova and Nanyang Technological University's S-Lab jointly released and open-sourced the new multimodal model architecture NEO. By innovating at the underlying architecture level, NEO achieves deep integration of vision and language, achieving comprehensive breakthroughs in performance, efficiency, and generality.

Extreme Data Efficiency: 1/10 Data Volume Achieves Top Performance

The most significant breakthrough of NEO lies in its extremely high data efficiency — it requires only 390 million image-text examples, equivalent to 1/10 of the data volume of industry models with similar performance, to develop top-tier visual perception capabilities. It does not rely on massive data or additional visual encoders, and through a concise architecture, it matches the performance of top modular flagship models such as Qwen2-VL and InternVL3 in various visual understanding tasks.

In multiple public and authoritative evaluations such as MMMU, MMB, MMStar, SEED-I, and POPE, NEO achieved high scores, with overall performance exceeding other native VLMs, truly realizing "precision lossless" for native architecture.

Breaking the "Modular" Design Constraints from the Bottom Up

Currently, most mainstream multimodal models follow the "visual encoder + projector + language model" modular paradigm. Although this extension approach based on large language models achieves compatibility with image input, it is essentially language-centered, and the integration of images and language only exists at the data level. This "modular" design not only leads to low learning efficiency but also limits the model's ability to handle complex multimodal scenarios, especially tasks involving image detail capture or complex spatial structure understanding.

NEO breaks through these limitations by innovating at three key dimensions: attention mechanism, position encoding, and semantic mapping, enabling the model to naturally process vision and language together.

Two Core Technological Innovations

Native Patch Embedding: NEO discards the discrete image tokenizer and builds a continuous mapping from pixels to tokens through its original Patch Embedding Layer (PEL). This design can more finely capture image details, fundamentally breaking through the image modeling bottlenecks of mainstream models.

Native Multi-Head Attention

NEO, the World's First Native Multimodal Architecture, Makes Its Debut with Perfect Integration of Vision and Language

AI expert Ilya Sutskever states that the era of simply scaling models is over, with future breakthroughs relying on smarter architecture design. This shift marks a move from a 'scale-only' approach to new pathways. In this context, a Chinese team has introduced the open-source native multimodal architecture NEO, offering a new direction for innovation in the field.....

SenseTime NEO Open Source: Achieve Top Multimodal Model Performance with 1/10 of the Data Volume, Ending the Era of Patchwork AI

Extreme Data Efficiency: 1/10 Data Volume Achieves Top Performance

Breaking the "Modular" Design Constraints from the Bottom Up

Two Core Technological Innovations

Related Recommendations

SenseTime Launches SenseNova-MARS: A New Chapter in Multimodal Autonomous Reasoning

SenseNova-MARS by SenseTime Open Source: Agentic VLM Empowers AI with Independent Thinking and Action Capabilities

China's First Cloud宇星空 Large Model Released, Aiding Intelligent Urban Planning!

NEO, the World's First Native Multimodal Architecture, Makes Its Debut with Perfect Integration of Vision and Language

SenseTime Open Sources NEO Architecture: Native Multimodal Model Abandons Puzzle-like Design, Achieves SOTA with 90% Less Data