NVIDIA Releases a Multimodal All-Round Model with Inference Efficiency 9 Times That of Competitors

The latest open-source multimodal large model, Nemotron3Nano Omni, was officially launched by NVIDIA on April 28 local time. This model is positioned as a "versatile performer," aiming to provide developers with faster and smarter interaction solutions through its deep reasoning capabilities that integrate video, audio, images, and text.

One of the major highlights of this model is its innovative technical architecture. Nemotron3Nano Omni uses a "mixture of experts" (MoE) architecture of 30B-A3B, directly integrating visual and audio encoders within the system. This integrated design breaks the previous situation where multimodal processing relied on multiple independent perception models, achieving a transition from "fragmented context" to "unified context."

Its performance data is particularly impressive. According to official data, the model ranks first in six authoritative rankings, including complex document processing, video understanding, and audio perception. Thanks to its unique perceptual accuracy, the system achieves nine times the throughput of similar open-source omnimodal models while maintaining high interactivity. This means companies can achieve stronger scalability at a lower cost when deploying AI agents without sacrificing real-time response.

Currently, several pioneering tech companies have integrated this model. Gautier Cloix, CEO of H Company, commented that thanks to the new architecture, their AI agent can now interpret full HD screen recordings in real time, marking a shift for AI from a simple task executor to an interactive entity capable of real-time perception and understanding of the digital environment.

NVIDIA Launches New Multimodal Model, Intelligent Agent Efficiency Increased Ninefold

Nvidia unveils the open multimodal model Nemotron 3 Nano Omni, integrating video, audio, image, and text reasoning. It uses a 30B-A3B mixture-of-experts architecture with built-in vision and audio encoders, eliminating extra perception models. This enhances large-scale inference efficiency and excels in complex text processing.....

Ant Group's Baoling Large Model Adds New Open-Source Member: Ling-2.6-flash Launches Officially

Ant Group's Bailing large model series has been updated with the official release of Ling-2.6-flash. The model has 104B total parameters and 7.4B activated parameters, offering BF16, FP8, and INT4 precision versions to suit various hardware environments and lower deployment barriers. It was previously tested anonymously on OpenRouter under the name 'Elephant Alpha'.....

Claude Deeply Integrates Eight Powerful Tools Like Adobe and Blender, Marking the Beginning of the AI Art Creation and Practice Era?

Anthropic announced deep integration of Claude with eight creative software including Adobe and Blender, embedding AI capabilities into graphic design, 3D modeling, and audio production workflows. Notably, the Adobe connector allows creators to invoke Claude directly within familiar tools, boosting efficiency.....

DeepSeek-V4 Preview Version Officially Released: 1M Long Context Enters the Era of Universal Access

DeepSeek launches DeepSeek-V4 preview series as open-source, achieving million-character ultra-long context and leading domestic Agent collaboration and knowledge reasoning. Offers Pro (1.6T parameters, 49B activated) and Flash versions; Pro matches top closed-source models, Flash balances efficiency.....

NVIDIA Releases a Multimodal All-Round Model with Inference Efficiency 9 Times That of Competitors

Related Recommendations

NVIDIA Launches New Multimodal Model, Intelligent Agent Efficiency Increased Ninefold

Ant Group's Baoling Large Model Adds New Open-Source Member: Ling-2.6-flash Launches Officially

Claude Deeply Integrates Eight Powerful Tools Like Adobe and Blender, Marking the Beginning of the AI Art Creation and Practice Era?

DeepSeek-V4 Preview Version Officially Released: 1M Long Context Enters the Era of Universal Access

Meituan Launches Native Multimodal LongCat-Next: Visual and Speech Achieve Bottom-Level Unification