Xiaomi officially released its self-developed large-scale speech synthesis model, Xiaomi MiMo-V2-TTS, marking significant progress in the field of highly controllable and expressive speech generation. The model is based on Xiaomi's self-developed Audio Tokenizer and a multi-codebook speech-text joint modeling architecture. Through large-scale pre-training using hundreds of millions of hours of speech data, it achieves precise adjustment from macro-level style to micro-level emotional details. Unlike traditional TTS, MiMo-V2-TTS can complete tone shifts and emotional changes within a single sentence, highly restoring the natural rhythm of human speech, and supporting song synthesis with accurate pitch and rhythm. On the technical level, Xiaomi introduced multi-dimensional reinforcement learning to balance the stability and expressiveness of the generated output. The model can intelligently recognize text signals such as punctuation, intonation words, and emphasis markers, converting them into appropriate speech expressions without the need for additional manual annotations. In addition, the model demonstrates strong cross-regional adaptability, supporting multiple dialects including Northeastern Mandarin, Sichuanese, Henanese, Cantonese, and Taiwanese accent, and can perform character-based performances.
As a key milestone in Xiaomi's voice technology roadmap, MiMo-V2-TTS will further expand multilingual coverage and deeply integrate with the multimodal understanding capabilities of MiMo-V2-Omni. This evolution from single speech synthesis to coordinated multimodal perception and expression indicates that AI agents are transitioning from simple semantic interaction to more personable and emotionally resonant human-computer interaction, significantly enhancing user experience in scenarios such as smart cabins and smart homes.

