Recently, a text-to-speech model named Qwen-TTS has made new progress. It has completed the update of its latest version through the Qwen API, bringing users a richer text-to-speech experience.
In this update, Qwen-TTS added support for three Chinese dialects: Beijing dialect, Shanghai dialect, and Sichuan dialect, further expanding its application scenarios. The model is trained on a large-scale corpus of more than 3 million hours, achieving human-level naturalness and expressiveness in synthesis. Qwen-TTS can not only accurately synthesize speech, but also automatically adjust intonation, rhythm, and emotional changes according to the input text, making the generated speech more natural and expressive.

Currently, Qwen-TTS supports seven standard Chinese and English voice tones, including standard voices like Cherry and Ethan, as well as special dialect-specific voices such as Dylan (Beijing dialect), Jada (Shanghai dialect), and Sunny (Sichuan dialect). Users can choose appropriate voices based on their needs for text-to-speech synthesis.
In practical applications, Qwen-TTS has demonstrated excellent performance. Whether describing scenes of daily life or expressing complex emotions, it can generate natural and smooth speech. For example, when using the Beijing dialect voice Dylan to synthesize text about childhood games, the speech is full of childlike fun and energy; while using the Shanghai dialect voice Jada to synthesize dialogues about daily life, it conveys an authentic Shanghai flavor.
The development team of Qwen-TTS stated that they will continue to optimize the model's performance and plan to launch more languages and voice styles to meet users' increasingly diverse needs. At the same time, they have provided a convenient API interface, making it easy for developers to integrate Qwen-TTS into their own applications.
Model Studio:https://help.aliyun.com/zh/model-studio/qwen-tts
