Yesterday, Volc Engine officially launched the Doubao Audio Generation Model 1.0 (Doubao-Seed-Audio 1.0), which supports using either text or audio as input to generate complete audio works end-to-end. The core breakthrough of this model is that a single Prompt can handle the full elements of dialogue, sound effects, and background music, completely eliminating the traditional workflow of manual multi-track editing.

Turn a sentence into an "audio director," skipping all post-production
Previously, producing a high-quality audio piece meant generating dialogue, sound effects, and music one by one, manually aligning them, and mixing multiple tracks, which was a complicated process highly dependent on post-production skills. Doubao Audio Generation Model 1.0 compresses all of this into a single Prompt: users can define multiple characters' lines, tone, and emotional rhythm in a single instruction, embed details like laughter, sighs, pauses, and dialect accents, and generate background music and ambient sound effects simultaneously, outputting a finished product directly. A creator can type a description, and immediately receive a podcast, audiobook, or brand audio ready for release.
Long audio doesn't "oversell," consistent character voices from start to finish
The most challenging issue in long audio creation is consistency—whether a character sounds the same in the first minute and the tenth minute. Doubao Audio Generation Model 1.0 achieves deep integration between text-to-audio and reference audio, maintaining consistent voice quality throughout long audio, so creators don't need to compare and revise segment by segment. The current model supports 2 minutes of audio creation at a time, and through the extended function, it maintains consistent voice quality during long-term generation, meeting the needs of audiobooks, podcasts, and long series.
In addition, the model supports decoupled control of voice and style, allowing the same voice to adapt to different emotions and contexts, even achieving "one voice, multiple roles"—the same voice presenting different expressions under different role settings, significantly improving flexibility in character voice acting and creative audio production. Currently, Volc Ark has opened API testing, and individual users can enjoy 30 minutes of creation quota in the Experience Center. Doubao Audio Generation Model 1.0 will also be launched on products such as CapCut, Jiemod, and Tomato.
