Qwen3.5-Omni Launches Shockingly: 215 SOTA Marks the Beginning of the All-Senses AI Era

Tongyi Lab officially launched the new multimodal large model Qwen3.5-Omni last night. Compared to its predecessor, this model has achieved a leap in comprehension, interaction, and task execution capabilities, marking the transition of AI from "a screen-based assistant" to "an intelligent agent that understands the physical world."

Core Breakthroughs: Full Modalities and 215 SOTA

Qwen3.5-Omni adopts a native "Full Modality" architecture, capable of seamlessly processing text, images, audio, and video inputs. In tests covering audio-visual analysis, reasoning, dialogue, and translation, the model achieved 215 SOTA (State-of-the-Art) results. Particularly in general audio understanding and recognition, its performance has surpassed Gemini-3.1Pro, while its visual and text capabilities remain at the top level, matching those of the same-size Qwen3.5 model.

Technical Deep Dive: Hybrid-Attention MoE Architecture

The model continues the classic Thinker-Talker division of labor and undergoes a fundamental restructuring:

Thinker (Understanding Center): Upgraded to Hybrid-Attention MoE, supporting 256K ultra-long context. This enables it to process up to 10 hours of audio or 1 hour of video, and accurately capture fine-grained information in long sequences using TMRoPE technology.
Talker (Expression Center): Introduces a new ARIA technology and RVQ coding, replacing heavy DiT computations. This not only solves common issues like missing words and misreading numbers in speech output, but also grants the model strong real-time voice control capabilities.

Scenario Implementation: From Vibe Coding to Voice Cloning

The evolution of Qwen3.5-Omni has directly translated into several groundbreaking application scenarios:

Natural Emergent Vibe Coding: The model demonstrates remarkable code comprehension and generation capabilities without specialized training, enabling it to generate Python code or front-end prototypes directly based on video logic.
Human-like Real-time Interaction: Supports semantic interruption. It can distinguish between background noises like coughing and actual interruptions, and allows users to adjust tone (e.g., "happy") and volume through instructions.
Fine-grained Video Decomposition: Can generate time-stamped structured captions, accurately identifying actions, background music changes, and camera transitions in videos.
Personalized Voice Cloning: Users need only upload a short recording to create a personalized "digital avatar" with high naturalness and support for 113 languages.

Currently, Qwen3.5-Omni is available on the Alibaba Cloud BaiLian platform, offering Plus, Flash, Light versions, and also provides real-time dialogue (Realtime) API and Demo in the ModelScope community.

SenseTime Secretly Developing Multimodal Model U1Pro: Led by Lin Dahua, Expected to Launch Internal Testing in July, Targeting OpenAI

SenseTime is secretly developing the multimodal large model U1Pro, targeting design scenarios, led by Chief Scientist Lin Dahua. The model belongs to the "Ri Ri Xin" family, aiming to compete with OpenAI's GPT-Image2, emphasizing long-range logic and thinking capabilities, and expected to launch internal testing and commercial use in July.

Understand First, Then Act! ByteDance Open-Sources Unified Framework Bernini to Make AI Video Editing Less About Guesswork

ByteDance open-sources the unified framework for video generation and editing called Bernini. Its core uses a 'understand first, then generate' collaborative mechanism to solve the problems of image instability and frame flickering caused by traditional models' inability to accurately understand complex text instructions. It breaks through bottlenecks such as subject deformation and background drift.

ByteDance Open Sources Lance 3B: A Single Model That Handles Both Vision and Language Understanding and Generation

ByteDance open-sources Lance, a native unified multimodal large model with only 3B activated parameters, breaking the technical barriers between understanding models (VLM) and generation models (DiT/Diffusion). It achieves full functionality with extreme lightweight design, challenging the current industry trend of stacking parameters or assembling models, marking an important breakthrough in technological innovation.

Tongyi Lab Launches Qwen3.7-Max with Orthogonal Decoupling Technology to Achieve Top Rankings in Multiple Domestic Evaluations

Tongyi Lab launches the new AI Agent base model Qwen3.7-Max, which achieved the top ranking in multiple evaluations domestically, aiming to solve issues of interruption and crash in intelligent agents' long-chain operations. In extreme pressure testing on unknown hardware ZW-M890L PPU, the model had no documents or prior data, but demonstrated long-term strategy coherence and generalization ability through runtime feedback. It ran continuously for 35 hours with 1,158 tool calls without any interruption.