When AI speech is no longer just "able to listen and speak," but can precisely orchestrate emotions, tone, and even character identity like a director, the naturalness of human-computer interaction is entering a new era. Xiaomi has officially launched the MiMo-V2.5 Full-Stack Speech Model Series, including three TTS (Text-to-Speech) models and one open-source ASR (Automatic Speech Recognition) model, fully covering the voice input and output needs in the Agent era, making sound truly a programmable, creative, and replicable intelligent medium.

🎙️ Three TTS Models: Voice Now "Under Your Command"
The MiMo-V2.5-TTS Series released by Xiaomi for the first time realizes the "language as control" paradigm in speech generation:
MiMo-V2.5-TTS: Built-in multiple high-fidelity premium voices, supporting fine control of speed, emotion, and tone through natural language instructions. Users don't need to fill in parameters; they just need to describe like directing an actor: "Speak with a gentle but firm tone, slightly slower, with a bit of fatigue," and the model will accurately perform it.
MiMo-V2.5-TTS-VoiceDesign: Generate a new voice with one sentence — input "a 30-year-old intellectual female voice with a slight southern accent, suitable for financial news broadcasting," and the system will immediately create a personalized voice, greatly lowering the barrier to voice creation.
MiMo-V2.5-TTS-VoiceClone: With only a small sample (e.g., 30 seconds of audio), it can high-fidelity replicate the target voice while retaining the ability to respond to style instructions and audio tags, suitable for virtual anchors and personalized assistants, among other scenarios.
A more groundbreaking feature is its layered script mechanism: In scenarios requiring high consistency, such as audiobooks or game NPCs, developers can separately define "character identity," "scene atmosphere," and "individual line performance guidance." Each layer can be independently updated yet work together, ensuring consistent character voice throughout, with each line of dialogue also showing variation.
In addition, the model supports inline audio tags (such as [emotion: excited]), which can be inserted at any position in the text to combine multiple tags, enabling complex emotional arrangements; even if the input is pure text without any hints, the model can automatically parse punctuation, sentence structure, and implied emotions, producing "vivid" speech.
🎧 Open-Source ASR: "All-Round Ears" in Noisy Real Scenarios
The simultaneously open-sourced MiMo-V2.5-ASR focuses on "hearing clearly and accurately":
- Supports main Chinese dialects such as Wu, Cantonese, Minnan, and Sichuan dialect;
- Fluently transcribes without pre-setting language in mixed-language (Code-Switch) scenarios;
- Maintains high robustness in complex environments such as strong noise, far-field pickup, and multi-person cross-talk (e.g., meetings);
- Accurately recognizes ancient poems, professional terms, and song lyrics (including background music interference);
- Natively outputs punctuation, allowing the transcribed result to be directly used for downstream tasks without post-processing.
In multiple authoritative evaluations, the model achieves industry-leading performance in dimensions such as general-purpose Chinese and English, dialects, code-switching, and lyric recognition.
🚀 Free Access + Open Source, Accelerating the Development of the Agent Ecosystem
Currently, the three TTS models are available for free limited-time access on the Xiaomi MiMo Open Platform, and developers can quickly experience them via API calls or MiMo Studio; while the MiMo-V2.5-ASR model weights and code have been fully open-sourced, supporting community re-development.
