Xiaomi Open Sources Major Project! OmniVoice Covers 600+ Languages for Zero-Shot Speech Cloning TTS: WER Only 0.84%, 40 Times Faster, Small Languages Can Also Be Resurrected Easily

Recently, the next-generation Kaldi team (k2-fsa) of Xiaomi officially open-sourced OmniVoice, a super-large-scale multilingual zero-shot text-to-speech (TTS) model supporting over 600 languages. It achieves SOTA (State-of-the-Art) in multiple key metrics on Chinese, English, and multilingual benchmarks, bringing a new breakthrough to the field of speech synthesis.

Performance Metrics Leading: Chinese WER as low as 0.84%, surpassing mainstream commercial models in multilingual scenarios

On the Seed-TTS Chinese test set, OmniVoice's word error rate (WER) is only 0.84%. In multilingual benchmarks, its similarity (SIM-o) and WER metrics exceed well-known models such as ElevenLabs v2 and MiniMax, demonstrating outstanding naturalness and clarity of speech.

Ultra-fast Inference: RTF as low as 0.025, 40 times faster than real-time

OmniVoice's real-time factor (RTF) is as low as 0.025, meaning the synthesis speed far exceeds real-time requirements, with significant efficiency improvements. This allows the model to quickly generate long-text speech in practical applications, greatly enhancing user experience.

Core Architecture Innovation: Discrete Non-Autoregressive Design Inspired by Diffusion Language Models

OmniVoice adopts a discrete non-autoregressive architecture inspired by diffusion language models, directly generating speech from text in one step, skipping the traditional intermediate semantic token stage. This design significantly simplifies the process while ensuring speech quality. The full codebook random masking strategy combined with pre-trained LLM initialization further improves training efficiency and the clarity and intelligibility of the final output.

Flexible Voice Cloning and Customization: Achievable with 3-10 seconds of reference audio

The model supports high-quality zero-shot voice cloning using short reference audio of 3-10 seconds. Additionally, users can customize voice attributes through natural language descriptions, including gender, age, pitch, accent, dialect, and even special effects such as whispering.

Support for Non-Linguistic Symbols and Fine-grained Pronunciation Control

OmniVoice can handle non-linguistic symbols, such as [laughter] representing laughter, and also supports pronunciation correction through pinyin or phonetic symbols, making it especially suitable for precise synthesis of Chinese and dialects.

Support for 600+ Languages: Helping in the Digital Preservation of Minority and Endangered Languages

The biggest highlight of OmniVoice lies in its extensive language coverage, efficiently supporting both major languages and numerous low-resource languages. For minority and endangered languages, high-quality speech can be generated with just a few samples, which is of great significance for the digital preservation and protection of language and culture.

OmniVoice's code and pre-trained models are now open-sourced on GitHub and Hugging Face, allowing developers to easily deploy them locally or integrate them into applications. AIbase will continue to follow community feedback and real-world use cases of OmniVoice. Developers are welcome to share more experiences.

Project Link: https://github.com/k2-fsa/OmniVoice

The MIIT and Three Other Departments Jointly Released the National Standard 'Grading of Intelligence for Artificial Intelligence Terminals'

On May 8, the MIIT, the Administration for Market Regulation, and the Ministry of Commerce jointly released the national standard 'Grading of Intelligence for Artificial Intelligence Terminals', establishing a unified evaluation system. The standard adopts a '2+N' framework, clarifying the definition of smart terminals, grading logic, and testing methods. It sets up four levels of capabilities from L1 Response Level to L4 Collaboration Level. The L4 level will be further improved with technological development.

Xiaomi's New Humanoid Robot Makes a Debut at Investor Day: Already Tested in Factories, with Over 90% Success Rate in Complex Installation

Xiaomi quietly showcased its new full-sized humanoid robot at the investor day on April 27th, which took four years of development. The robot moved from the lab to factory operations, making its debut through high-difficulty delicate interactions such as distributing gifts and greeting people. After "Tie Dan" in 2021 and "Tie Da" in 2022, Xiaomi has been developing quietly for four years, and this return highlights breakthroughs in embodied intelligence technology.

Sub-millimeter Precision Alignment: Xiaomi Open Sources the Full Post-Training Process of VLA Large Model

Xiaomi has recently open-sourced the real-world post-training process of its vision-language-action large model, Xiaomi-Robotics-0, promoting the development of embodied intelligence. The team enabled the robot to master precise earphone storage and other complex tasks using only about 20 hours of task data, demonstrating the ability to quickly learn complex skills.

Xiaomi Open Sources Full Post-Training Process for VLA Large Model, Enabling Robotic Sub-Millimeter-Level Operations

Xiaomi, following the open-source VLA model Xiaomi-Robotics-0 in February, recently disclosed its full real-robot post-training process to address the 'last mile' from lab to production. In demos, robots with this model showed fine manipulation skills after just 20 hours of training, advancing AI robots as out-of-the-box productivity tools.....

Xiaomi Launches the Most Powerful Model Series MiMo-V2.5, Official Public Testing Begins

Xiaomi released the MiMo-V2.5 series of large models on April 23 and initiated public testing. The series includes four models, with the core models MiMo-V2.5-Pro and MiMo-V2.5 being open-sourced globally, demonstrating its commitment to promoting an open AI ecosystem. This update is not only a product iteration but also a comprehensive upgrade of the technology foundation, featuring flagship performance that supports a context length of up to one million and complex task processing.