China's large models are making a stunning transformation from "following" to "leading" in the field of multimodal interaction.

On March 30, Ali officially launched the new generation of multimodal large model Qwen3.5-Omni. With its hybrid attention MoE architecture, this model achieves seamless input and output for images, videos, audio, and text, marking that domestic large models have reached global top levels in audio-visual interaction.

image.png

All-Round Power: 215 Tasks Won, Strongly Surpassing Gemini

In the hard metrics that measure the comprehensive strength of large models, Qwen3.5-Omni has demonstrated a dominant performance:

SOTA Dominance: In a total of 215 testing tasks including audio-video understanding, recognition, and interaction, the model achieved SOTA (best performance) results.

Superior Performance: In tests such as DailyOmni and QualcommInteractive, which focus on audio-visual interaction, its score significantly outperformed Google's Gemini-3.1Pro.

Exceptional Interference Resistance: In the WenetSpeech test under noisy environments, its recognition accuracy was extremely high, with an error rate far lower than competitors.

Interactive Revolution: 113 Languages Recognized and "Speak-to-Code" Programming

Qwen3.5-Omni is not only smarter but also better at understanding "dialects" and "code":

Language Expertise: It supports the recognition of 113 languages and dialects, even capturing rare languages like Maori and Hainan dialect with precision.

Vibe Coding Evolution: It opens a new era of audio-video programming. Users just need to open the camera and describe their needs to the sketch, and the model can directly generate product prototype interfaces with complex UI, truly achieving "what you say is what you get."

Productivity Explosion: 10-Hour Audio Long-Term Understanding

For professional fields, the new model provides strong structured processing capabilities:

Deep Video Analysis: It can finely decompose the main subject of the image, relationships between people, and emotional fluctuations.

Automatic Segmentation: It supports over 10 hours of audio input and can automatically complete video chapter segmentation and timestamp annotation, greatly improving content creation efficiency.

Inclusive Ecosystem: Price is Just One-Tenth of Gemini

The Aliyun BaiLian platform has simultaneously launched three APIs: Plus, Flash, and Light, aiming to provide the most cost-effective choice for enterprises:

Very Low Cost: The input cost per million Tokens is less than 0.8 yuan, costing less than one-tenth of Gemini-3.1Pro.

Market Leadership: Currently, Qwen has served over one million customers and remains the top choice in the enterprise-level large model call market in China.

Conclusion: From "Understanding Text" to "Perceiving the World"

The release of Qwen3.5-Omni