Step Audio Model Ranks Among the Top Three Globally, Setting a New High for Chinese Large Models in Speech Perception

In the latest global authoritative TTS (Text-to-Speech) evaluation ranking, Artificial Analysis Speech Arena Leaderboard, StepFun, a Chinese company, demonstrated strong capabilities. Its speech generation model, StepAudio2.5TTS, ranked in the top three globally due to its excellent listening experience, becoming the highest-ranked Chinese large model product on this list so far.

Differing from traditional laboratory data metrics, this ranking uses a more rigorous "blind test Elo scoring mechanism." In this mode, users evaluate two audio clips generated from the same text without knowing the model's identity, based on subjective listening experience. Test scenarios cover real-life situations such as online customer service, knowledge sharing, digital assistants, and entertainment interactions. StepFun's victory indicates that its generated speech has more "human touch" in real user feedback, and it has achieved international top-level competitiveness in tone naturalness and expressive appeal.

Currently, StepFun has released the full-chain models of the StepAudio2.5 series, including the TTS responsible for speech generation, the ASR for high-precision recognition, and the newly launched Realtime real-time interaction model. The Realtime model particularly emphasizes creating a "human-like feel," using top-tier paralanguage capabilities and customizable character settings to create a warm and soulful AI conversation partner for users.

In fact, the company has already laid out its strategy in the field of voice AI. Its open-source native inference model, Step Audio R1.1, has been at the top of another global voice inference ranking for four consecutive months; while another open-source emotional style editing model, Step Audio EditX, can complete high-quality voice replication with just 3 seconds of material, demonstrating extremely high technical efficiency.

Xiaomi Launches Full-Chain Speech Large Model MiMo-V2.5 TTS Can Generate New Voice Models with a Single Sentence ASR Open Source Supports Dialects and Multilingual Mixtures

Xiaomi launched the MiMo-V2.5 full-chain voice model series, featuring three TTS models and one open-source ASR model, covering voice input and output. The TTS models precisely control emotion, tone, and character identity, making voice programmable, creative, and replicable, enhancing human-machine interaction naturalness and ushering in a new era of voice intelligence.....

Xiaomi MiMo-V2.5 Shocking Beta Test: 4.3 Hours of Manual Compiler Development, Long-Range Intelligent Agent Achieves a Full Leap

Xiaomi released the MiMo-V2.5 series of large models, including MiMo-V2.5, V2.5-Pro, and accompanying TTS and ASR models, marking an upgrade from "usable" to "user-friendly." The flagship model MiMo-V2.5-Pro has reached competitive levels with top models such as Claude Opus4.6 and GPT-5.4 in terms of general intelligent agent capabilities and software engineering. Its core advantage lies in high instruction adherence and self-correction capabilities.

Xiaomi Open Sources Major Project! OmniVoice Covers 600+ Languages for Zero-Shot Speech Cloning TTS: WER Only 0.84%, 40 Times Faster, Small Languages Can Also Be Resurrected Easily

Xiaomi Kaldi team open-sources the OmniVoice model, supporting over 600 languages. It achieves SOTA performance in multiple metrics on Chinese and multilingual TTS benchmark tests. The Chinese WER is as low as 0.84%, and the multilingual performance surpasses mainstream commercial models, achieving a new breakthrough in speech synthesis.

The Robot Can Now Speak! Zhiyuan Collaborates with MiniMax to Customize Personalized Human Models for Each Person

MiniMax has reached a strategic cooperation with Zhiyuan Robot, providing full-process AI technology support to promote the evolution of embodied intelligence from 'core movement' to 'emotional interaction'. The cooperation focuses on building a deeply customized interaction system for Zhiyuan Robot, including a personalized human model system, to enhance the robot's emotional interaction capabilities.

Microsoft Launches VibeVoice-Realtime: A New Real-Time Text-to-Speech Model for Interactive Applications

Microsoft launches VibeVoice-Realtime-0.5B, a lightweight real-time text-to-speech model supporting streaming input and long-form output for agent applications and live data narration. It starts speech output in about 300ms, works with language models for responses, and uses a framework with continuous speech tokens for next-token diffusion.....