In the latest global authoritative TTS (Text-to-Speech) evaluation ranking, Artificial Analysis Speech Arena Leaderboard, StepFun, a Chinese company, demonstrated strong capabilities. Its speech generation model, StepAudio2.5TTS, ranked in the top three globally due to its excellent listening experience, becoming the highest-ranked Chinese large model product on this list so far.

Differing from traditional laboratory data metrics, this ranking uses a more rigorous "blind test Elo scoring mechanism." In this mode, users evaluate two audio clips generated from the same text without knowing the model's identity, based on subjective listening experience. Test scenarios cover real-life situations such as online customer service, knowledge sharing, digital assistants, and entertainment interactions. StepFun's victory indicates that its generated speech has more "human touch" in real user feedback, and it has achieved international top-level competitiveness in tone naturalness and expressive appeal.

image.png

Currently, StepFun has released the full-chain models of the StepAudio2.5 series, including the TTS responsible for speech generation, the ASR for high-precision recognition, and the newly launched Realtime real-time interaction model. The Realtime model particularly emphasizes creating a "human-like feel," using top-tier paralanguage capabilities and customizable character settings to create a warm and soulful AI conversation partner for users.

In fact, the company has already laid out its strategy in the field of voice AI. Its open-source native inference model, Step Audio R1.1, has been at the top of another global voice inference ranking for four consecutive months; while another open-source emotional style editing model, Step Audio EditX, can complete high-quality voice replication with just 3 seconds of material, demonstrating extremely high technical efficiency.