Recently, the next-generation Kaldi team (k2-fsa) of Xiaomi officially open-sourced OmniVoice, a super-large-scale multilingual zero-shot text-to-speech (TTS) model supporting over 600 languages. It achieves SOTA (State-of-the-Art) in multiple key metrics on Chinese, English, and multilingual benchmarks, bringing a new breakthrough to the field of speech synthesis.

Performance Metrics Leading: Chinese WER as low as 0.84%, surpassing mainstream commercial models in multilingual scenarios

On the Seed-TTS Chinese test set, OmniVoice's word error rate (WER) is only 0.84%. In multilingual benchmarks, its similarity (SIM-o) and WER metrics exceed well-known models such as ElevenLabs v2 and MiniMax, demonstrating outstanding naturalness and clarity of speech.

image.png

Ultra-fast Inference: RTF as low as 0.025, 40 times faster than real-time

OmniVoice's real-time factor (RTF) is as low as 0.025, meaning the synthesis speed far exceeds real-time requirements, with significant efficiency improvements. This allows the model to quickly generate long-text speech in practical applications, greatly enhancing user experience.

Core Architecture Innovation: Discrete Non-Autoregressive Design Inspired by Diffusion Language Models

OmniVoice adopts a discrete non-autoregressive architecture inspired by diffusion language models, directly generating speech from text in one step, skipping the traditional intermediate semantic token stage. This design significantly simplifies the process while ensuring speech quality. The full codebook random masking strategy combined with pre-trained LLM initialization further improves training efficiency and the clarity and intelligibility of the final output.

Flexible Voice Cloning and Customization: Achievable with 3-10 seconds of reference audio

The model supports high-quality zero-shot voice cloning using short reference audio of 3-10 seconds. Additionally, users can customize voice attributes through natural language descriptions, including gender, age, pitch, accent, dialect, and even special effects such as whispering.

Support for Non-Linguistic Symbols and Fine-grained Pronunciation Control

OmniVoice can handle non-linguistic symbols, such as [laughter] representing laughter, and also supports pronunciation correction through pinyin or phonetic symbols, making it especially suitable for precise synthesis of Chinese and dialects.

Support for 600+ Languages: Helping in the Digital Preservation of Minority and Endangered Languages

The biggest highlight of OmniVoice lies in its extensive language coverage, efficiently supporting both major languages and numerous low-resource languages. For minority and endangered languages, high-quality speech can be generated with just a few samples, which is of great significance for the digital preservation and protection of language and culture.

OmniVoice's code and pre-trained models are now open-sourced on GitHub and Hugging Face, allowing developers to easily deploy them locally or integrate them into applications. AIbase will continue to follow community feedback and real-world use cases of OmniVoice. Developers are welcome to share more experiences.

Project Link: https://github.com/k2-fsa/OmniVoice