Sensenova and Nanyang Technological University's S-Lab jointly released and open-sourced the new multimodal model architecture NEO. By innovating at the underlying architecture level, NEO achieves deep integration of vision and language, achieving comprehensive breakthroughs in performance, efficiency, and generality.
Extreme Data Efficiency: 1/10 Data Volume Achieves Top Performance
The most significant breakthrough of NEO lies in its extremely high data efficiency — it requires only 390 million image-text examples, equivalent to 1/10 of the data volume of industry models with similar performance, to develop top-tier visual perception capabilities. It does not rely on massive data or additional visual encoders, and through a concise architecture, it matches the performance of top modular flagship models such as Qwen2-VL and InternVL3 in various visual understanding tasks.
In multiple public and authoritative evaluations such as MMMU, MMB, MMStar, SEED-I, and POPE, NEO achieved high scores, with overall performance exceeding other native VLMs, truly realizing "precision lossless" for native architecture.

Breaking the "Modular" Design Constraints from the Bottom Up
Currently, most mainstream multimodal models follow the "visual encoder + projector + language model" modular paradigm. Although this extension approach based on large language models achieves compatibility with image input, it is essentially language-centered, and the integration of images and language only exists at the data level. This "modular" design not only leads to low learning efficiency but also limits the model's ability to handle complex multimodal scenarios, especially tasks involving image detail capture or complex spatial structure understanding.
NEO breaks through these limitations by innovating at three key dimensions: attention mechanism, position encoding, and semantic mapping, enabling the model to naturally process vision and language together.
Two Core Technological Innovations
Native Patch Embedding: NEO discards the discrete image tokenizer and builds a continuous mapping from pixels to tokens through its original Patch Embedding Layer (PEL). This design can more finely capture image details, fundamentally breaking through the image modeling bottlenecks of mainstream models.
Native Multi-Head Attention
