Tongyi Lab officially launched the new multimodal large model Qwen3.5-Omni last night. Compared to its predecessor, this model has achieved a leap in comprehension, interaction, and task execution capabilities, marking the transition of AI from "a screen-based assistant" to "an intelligent agent that understands the physical world."

Core Breakthroughs: Full Modalities and 215 SOTA

Qwen3.5-Omni adopts a native "Full Modality" architecture, capable of seamlessly processing text, images, audio, and video inputs. In tests covering audio-visual analysis, reasoning, dialogue, and translation, the model achieved 215 SOTA (State-of-the-Art) results. Particularly in general audio understanding and recognition, its performance has surpassed Gemini-3.1Pro, while its visual and text capabilities remain at the top level, matching those of the same-size Qwen3.5 model.

QQ20260331-090527.jpg

Technical Deep Dive: Hybrid-Attention MoE Architecture

The model continues the classic Thinker-Talker division of labor and undergoes a fundamental restructuring:

  • Thinker (Understanding Center): Upgraded to Hybrid-Attention MoE, supporting 256K ultra-long context. This enables it to process up to 10 hours of audio or 1 hour of video, and accurately capture fine-grained information in long sequences using TMRoPE technology.

  • Talker (Expression Center): Introduces a new ARIA technology and RVQ coding, replacing heavy DiT computations. This not only solves common issues like missing words and misreading numbers in speech output, but also grants the model strong real-time voice control capabilities.

Scenario Implementation: From Vibe Coding to Voice Cloning

The evolution of Qwen3.5-Omni has directly translated into several groundbreaking application scenarios:

  1. Natural Emergent Vibe Coding: The model demonstrates remarkable code comprehension and generation capabilities without specialized training, enabling it to generate Python code or front-end prototypes directly based on video logic.

  2. Human-like Real-time Interaction: Supports semantic interruption. It can distinguish between background noises like coughing and actual interruptions, and allows users to adjust tone (e.g., "happy") and volume through instructions.

  3. Fine-grained Video Decomposition: Can generate time-stamped structured captions, accurately identifying actions, background music changes, and camera transitions in videos.

  4. Personalized Voice Cloning: Users need only upload a short recording to create a personalized "digital avatar" with high naturalness and support for 113 languages.

Currently, Qwen3.5-Omni is available on the Alibaba Cloud BaiLian platform, offering Plus, Flash, Light versions, and also provides real-time dialogue (Realtime) API and Demo in the ModelScope community.