Speech synthesis technology is making a qualitative leap from "mechanical repetition" to "emotional resonance." On March 19, Xiaomi officially launched its self-developed speech synthesis large model Xiaomi MiMo-V2-TTS. This is not only a tool that allows machines to "speak," but also a "versatile voice actor" capable of performing, speaking, and singing.

MiMo-V2-TTS is based on Xiaomi's self-developed Audio Tokenizer (audio tokenizer) and multi-codebook speech-text joint modeling architecture. After being pre-trained on massive speech data for hundreds of millions of hours, it demonstrates remarkable multi-granularity speech style control:
Emotion Master: The model supports precise adjustment from overall tone to local emotions. It can naturally shift tone and subtly change emotion within the same sentence, perfectly replicating the natural rhythm of human speech.
Cross-over Singer: In addition to speaking, it also has high-quality singing synthesis capabilities, accurately expressing pitch and rhythm, with natural and expressive vocal techniques.
Dialect Expert: To better suit different regional users, the model supports multiple dialects such as Northeastern, Sichuan, Henan, Cantonese, and Taiwanese accents, and can perform them in a character- or style-based manner.
Notably, MiMo-V2-TTS greatly reduces the interaction cost. It can intelligently recognize punctuation, intonation words, and emphasis markers in the text, and automatically convert them into appropriate speech expressions, without requiring any additional annotations or manual intervention from the user throughout the process.
For Xiaomi, the release of this large model marks a key milestone in its speech technology roadmap. In future plans, Xiaomi aims to cover more languages beyond Chinese and English, and to integrate it deeply with the multimodal understanding capabilities of MiMo-V2-Omni.
When AI agents can not only understand the world but also tell it with a compelling human voice, the future of human-computer interaction is already clear. With the implementation of MiMo-V2-TTS, smart devices within the Xiaomi ecosystem will no longer be cold terminals, but more "human-like" digital companions.