Recently, the Tongyi Qwen team of Alibaba officially released its latest version, Qwen3-Omni-Flash-2025-12-01. This upgraded version is built upon Qwen3-Omni and serves as a new generation of native multimodal large model, capable of efficiently processing various input formats such as text, images, audio, and video, achieving real-time streaming responses and generating text and natural speech outputs.

image.png

The main highlights of this upgrade include an overall enhancement in audio and video interaction experience. This version significantly improves the understanding and execution capabilities for audio and video instructions, effectively solving the "dumbing down" issues commonly found in conversational scenarios. The stability and coherence of multi-turn audio and video conversations have been enhanced, making human-computer interaction more natural and smooth.

In addition, the system prompt control capabilities have made significant progress. Users can fully customize the system prompt, finely control the model's behavior. Whether it's role style, spoken expression preferences, or response length requirements, they can be precisely achieved, enhancing the model's controllability.

In terms of multilingual processing capabilities, the new version supports 119 text languages, 19 speech recognition languages, and 10 speech synthesis languages. Compared to previous versions, Qwen3-Omni-Flash has comprehensively optimized language adherence stability, ensuring the accuracy of responses in cross-language scenarios.

The performance of speech generation has also become more human-like and fluent. The new version effectively solves the problems of slow speaking speed and mechanical feel, improving the model's ability to adaptively adjust speaking speed, pauses, and rhythm based on the content of the text, making the speech output closer to real conversation.

In objective performance metrics, the full-modal capabilities of Qwen3-Omni-Flash-2025-12-01 have been significantly improved. Text comprehension and generation capabilities, speech understanding accuracy, speech generation naturalness, and image comprehension depth have all surpassed previous versions, providing users with an unprecedented natural, accurate, and vivid AI interaction experience.

Key Points:

🌟 The new version of Qwen3-Omni-Flash enhances the audio and video interaction experience, improving the understanding and execution capabilities for audio and video instructions.  

🌍 The system prompt customization function is fully open, allowing users to finely control the model's behavior and enhance the personalization of interactions.  

💬 Multilingual support capabilities are optimized, ensuring the accuracy and consistency of responses in cross-language scenarios.