215 SOTA Wins! Alibaba Releases Qwen3.5-Omni: Outstanding Cross-Modal Performance Exceeding Gemini

China's large models are making a stunning transformation from "following" to "leading" in the field of multimodal interaction.

On March 30, Ali officially launched the new generation of multimodal large model Qwen3.5-Omni. With its hybrid attention MoE architecture, this model achieves seamless input and output for images, videos, audio, and text, marking that domestic large models have reached global top levels in audio-visual interaction.

All-Round Power: 215 Tasks Won, Strongly Surpassing Gemini

In the hard metrics that measure the comprehensive strength of large models, Qwen3.5-Omni has demonstrated a dominant performance:

SOTA Dominance: In a total of 215 testing tasks including audio-video understanding, recognition, and interaction, the model achieved SOTA (best performance) results.

Superior Performance: In tests such as DailyOmni and QualcommInteractive, which focus on audio-visual interaction, its score significantly outperformed Google's Gemini-3.1Pro.

Exceptional Interference Resistance: In the WenetSpeech test under noisy environments, its recognition accuracy was extremely high, with an error rate far lower than competitors.

Interactive Revolution: 113 Languages Recognized and "Speak-to-Code" Programming

Qwen3.5-Omni is not only smarter but also better at understanding "dialects" and "code":

Language Expertise: It supports the recognition of 113 languages and dialects, even capturing rare languages like Maori and Hainan dialect with precision.

Vibe Coding Evolution: It opens a new era of audio-video programming. Users just need to open the camera and describe their needs to the sketch, and the model can directly generate product prototype interfaces with complex UI, truly achieving "what you say is what you get."

Productivity Explosion: 10-Hour Audio Long-Term Understanding

For professional fields, the new model provides strong structured processing capabilities:

Deep Video Analysis: It can finely decompose the main subject of the image, relationships between people, and emotional fluctuations.

Automatic Segmentation: It supports over 10 hours of audio input and can automatically complete video chapter segmentation and timestamp annotation, greatly improving content creation efficiency.

Inclusive Ecosystem: Price is Just One-Tenth of Gemini

The Aliyun BaiLian platform has simultaneously launched three APIs: Plus, Flash, and Light, aiming to provide the most cost-effective choice for enterprises:

Very Low Cost: The input cost per million Tokens is less than 0.8 yuan, costing less than one-tenth of Gemini-3.1Pro.

Market Leadership: Currently, Qwen has served over one million customers and remains the top choice in the enterprise-level large model call market in China.

Conclusion: From "Understanding Text" to "Perceiving the World"

The release of Qwen3.5-Omni

215 SOTA Wins! Alibaba Releases Qwen3.5-Omni: Outstanding Cross-Modal Performance Exceeding Gemini

Related Recommendations

Claude Prime Model Fable 5 Launches Pay-As-You-Go Pricing Model, Subscription Users Have Limited Benefits

Say Goodbye to Code Refactoring Anxiety: Alibaba Open Sources Page Agent to Help Large Models Understand Web Page Logic

Meituan's Large Model Ecosystem Adjustment: Fully Restricting Doubao and Promoting Its Self-Developed LongCat System

China's Large Model Scene Faces a Major Breakthrough: Kimi K3 to Be Released This Month with Parameters Reaching 2.5 Trillion

The Giant in the Computing Field Has Arrived: Meituan Opensources the Trillion-Parameter Model LongCat-2.0