The Alibaba Qwen team officially open-sourced the Qwen3-TTS series of speech generation models last night. This major update quickly swept through the open-source community and is seen as a significant breakthrough in the field of text-to-speech synthesis. The series adopts an end-to-end architecture, supporting second-level voice cloning, natural language voice design, and real-time streaming output, greatly lowering the threshold for real-time applications.

image.png

Dual-Track Dual-Track Architecture Achieves Ultra-Low Latency

The core innovation of Qwen3-TTS lies in its Dual-Track hybrid streaming generation mechanism, combined with a discrete multi-codebook language model, directly modeling speech end-to-end, avoiding the information bottleneck of traditional cascading architectures (such as LM+DiT). In practical testing, the end-to-end latency reaches as low as 97ms, with only one character input needed to output the first audio packet. This ultra-responsive speed makes it particularly suitable for scenarios sensitive to delay, such as live interaction, real-time translation, and AI customer service.

3-Second Rapid Cloning + Zero-Loss Migration Across Languages/Dialects

The voice cloning capability is especially impressive: with just 3 seconds of reference audio, high-fidelity zero-sample voice replication can be achieved. The cloned voice supports seamless cross-language migration. A Chinese voice can be directly used to speak English, Japanese, Korean, German, French, Russian, Spanish, Portuguese, and Italian, among 10 mainstream languages, while retaining the original voice characteristics. Furthermore, it can naturally output various Chinese dialects, such as Sichuanese and Beijing dialect, with highly accurate accents and expressions, opening up new possibilities for multilingual content creation and localized applications.

Design a New Voice with Just One Sentence

Aside from cloning, Qwen3-TTS also offers powerful Voice Design features, allowing users to customize voices through natural language instructions, such as "tell a story with a gentle and encouraging mature female voice" or "a high-pitched and excited young male voice to explain a game." The model can automatically adjust tone, emotion, and rhythm to generate highly personalized expressions. This "what you think is what you hear" control capability is especially useful in audiobook production—allowing one person to play multiple roles, mastering emotional shifts and dialect changes, greatly enhancing immersion and productivity.

Two Sizes: 1.7B and 0.6B, Choose Performance and Efficiency Freely

The Qwen3-TTS family provides two parameter sizes:

- 1.7B model: highest performance, strong control capabilities, suitable for cloud scenarios where high audio quality and expressiveness are required;

- 0.6B model: achieves better inference efficiency and lower resource consumption while maintaining excellent synthesis quality, suitable for edge devices or high-concurrency deployments.

The official has open-sourced the complete series (including Base, VoiceDesign, CustomVoice, etc.) on GitHub and Hugging Face, supporting full-parameter fine-tuning, allowing developers to easily build brand-specific voice identities.

With the open sourcing of Qwen3-TTS, the barriers to real-time, personalized, and multilingual speech AI have been significantly reduced. Content creators, developers, and enterprise applications will all experience a new wave of voice interaction revolution.

Project Address: https://github.com/QwenLM/Qwen3-TTS