Editing audio like editing a Word document? StepXenon releases the 3 billion parameter audio editing model Step-Audio-EditX

The barrier to audio creation has been completely broken. Domestic AI unicorn StepStepFun AI officially launched the world's leading LLM-level audio editing model - Step-Audio-EditX on November 9th, achieving a revolutionary experience of "editing voice with natural language instructions" for the first time. Users just need to input "change this sentence into the arrogant tone of a Sichuan-Chongqing rapper" or "add a shy laugh at the end," and the model can accurately adjust the voice, emotion, rhythm, and even breathing pauses, making voice editing as intuitive and efficient as editing a document.

3 Billion Parameters, Performance Improves

The core breakthrough of Step-Audio-EditX lies in its ultra-efficient model compression technology. The team refined the original 13 Billion parameter model to just 3 Billion, not only significantly reducing deployment costs but also achieving a breakthrough in key indicators. The model supports zero-shot voice cloning - with just one reference audio, no training data from the target person is needed, and it can high-fidelity reproduce their voice; it also supports multi-round iterative editing, allowing users to issue detailed instructions continuously (such as "be a little gentler" or "extend the laughter by 0.3 seconds"), gradually approaching the desired effect.

Dialects and Emotions, Perfectly Mastered

The model's understanding of the Chinese context is particularly impressive, supporting Mandarin, English, Sichuan dialect, and Cantonese fluently. The regional emotions and pragmatic habits in dialect expressions are naturally and realistically presented. In blind tests, evaluators unanimously believed that its "Sichuan-Chongqing jokes' street sense" and "delicacy of Cantonese particles" far exceed similar products.

Challenging Closed-Source Commercial Models, Leading in Three Key Indicators

According to data obtained by AIbase, Step-Audio-EditX outperforms closed-source solutions such as Minimax and ByteDance Doubao in three core dimensions:

Naturalness score: 4.72/5 (Minimax 4.51, Doubao 4.38)

Emotion accuracy: 93.7% (leading the second place by 6.2 percentage points)

Voice consistency: 98.1%, almost lossless reproduction

Explosive Application Scenarios: From Short Videos to Accessibility Services

This technology is giving rise to new content forms:

Short video bloggers can switch between "cheerful girl" and "sarcastic instructor" voices with one click;

Audiobook creators can complete multiple character emotional dialogues alone;

Sichuan dialect comedy videos reworked by AI can instantly become American-style stand-up comedy for export;

Speech synthesis systems for hearing-impaired users now have "emotional warmth," no longer cold and mechanical.

AIbase believes that the significance of Step-Audio-EditX goes beyond tool upgrades - it is reshaping the logic of audio content production. When voice is no longer a linear medium that is fixed once recorded, but becomes a "living text" that can be repeatedly refined, millions of creators will gain unprecedented expressive freedom. Next, if StepStepFun opens an API or integrates into mobile systems, this "AI magic scissors" may truly enter everyone's pocket, allowing every voice to be reimagined.

Product Entry: https://stepaudiollm.github.io/step-audio-editx/

Editing audio like editing a Word document? StepXenon releases the 3 billion parameter audio editing model Step-Audio-EditX

Related Recommendations

Step-Audio-EditX Launch: 3 Billion Parameter Audio LLM Opens the Era of Voice Editing

StepFun AI Launches Open-Source Audio Editing Model Step-Audio-EditX for a New Audio Editing Experience

Generate a Professional PPT in One Sentence! Open Source Tool Presentation-AI Now Available for Free, Runs Locally, Outperforms Gamma.app

Step-Audio-EditX Launch: 3 Billion Parameter Audio LLM Opens the Era of Voice Editing

50 Million USD in Seed Funding! Stanford Professor Founded Inception to Challenge GPT-5 with a Diffusion-based Large Model, Code Generation Speed Exceeds 1000 Token/Second