Recently, the Qwen team at Alibaba Cloud has released two new artificial intelligence models designed to generate or clone voices through text instructions. The Qwen3-TTS-VD-Flash model allows users to generate voices based on detailed descriptions, enabling users to precisely define voice characteristics such as emotion and speaking rhythm.

For example, users can request to generate a "middle-aged man with a loud baritone voice — an energetic advertisement narrator, fast speech rate, exaggerated tone changes, and a sales-persuasive voice." According to the manufacturer, this model outperforms OpenAI's recently launched GPT-4o mini-tts API in performance.

The second model, Qwen3-TTS-VC-Flash, can replicate a voice with just three seconds of audio and can reproduce it in ten languages. Qwen claims that the error rate of this model is lower than its competitors, such as Elevenlabs or MiniMax.

In addition, this AI can handle complex texts, imitate animal sounds, and extract voices from recordings. Both models are accessible via Alibaba Cloud's API, and users can also try the design model and cloning model demos on the Hugging Face platform.

Key Points:  

🌟 New Qwen models support generating and cloning voices through text descriptions.  

🎤 Qwen3-TTS-VC-Flash can replicate a voice in three seconds and supports ten languages.  

🚀 The model performs better than competitors and is suitable for handling complex texts and voice imitation.