The latest open-source multimodal large model, Nemotron3Nano Omni, was officially launched by NVIDIA on April 28 local time. This model is positioned as a "versatile performer," aiming to provide developers with faster and smarter interaction solutions through its deep reasoning capabilities that integrate video, audio, images, and text.

One of the major highlights of this model is its innovative technical architecture. Nemotron3Nano Omni uses a "mixture of experts" (MoE) architecture of 30B-A3B, directly integrating visual and audio encoders within the system. This integrated design breaks the previous situation where multimodal processing relied on multiple independent perception models, achieving a transition from "fragmented context" to "unified context."

image.png

Its performance data is particularly impressive. According to official data, the model ranks first in six authoritative rankings, including complex document processing, video understanding, and audio perception. Thanks to its unique perceptual accuracy, the system achieves nine times the throughput of similar open-source omnimodal models while maintaining high interactivity. This means companies can achieve stronger scalability at a lower cost when deploying AI agents without sacrificing real-time response.

Currently, several pioneering tech companies have integrated this model. Gautier Cloix, CEO of H Company, commented that thanks to the new architecture, their AI agent can now interpret full HD screen recordings in real time, marking a shift for AI from a simple task executor to an interactive entity capable of real-time perception and understanding of the digital environment.