Renowned artificial intelligence company Stability AI has officially released its latest generation audio large model Stable Audio3 and simultaneously open-sourced part of the model weights. As a latent diffusion model specifically designed for audio generation and editing, this system not only supports high-quality stereo output but also achieves a significant breakthrough in generation speed.

The newly released model family covers a wide range of specifications, from small to large, meeting diverse needs such as music creation and sound effect production. Notably, the model supports variable-length audio generation and introduces an audio editing feature based on internal image completion technology, offering creators unprecedented flexibility.

image.png

Innovative Architecture Breaks Hardware Limitations

Stable Audio3 is composed of two core components: a semantic acoustic autoencoder called SAME, and an efficient diffusion transformer. Among them, the SAME autoencoder achieves an audio compression rate of up to 4096 times, a breakthrough design that significantly shortens the length of the latent sequence.

Thanks to this efficient compression mechanism, even on ordinary consumer-grade hardware, the model can run long-period, large-scale audio generation tasks smoothly. This not only significantly lowers the technical barriers for high-quality audio creation but also makes professional-level audio and video production at home possible for individual creators.

image.png

Ultra Efficiency Achieves Instant Rendering

With the support of variable-length technology, the new model's computational cost can dynamically scale with the user's required audio duration, completely eliminating the computing power waste caused by fixed lengths in the past. In tests on high-performance hardware, the model can render a 20-second audio in about 0.62 seconds, and generate a 380-second music in just 1.31 seconds.

Additionally, through an innovative three-stage training process, Stable Audio3 no longer relies on traditional classifier-free guidance technology during inference, thus achieving a super-fast single-step forward propagation experience. Currently, the small and medium model weights are available on the Hugging Face platform for public access, while the larger version with stronger performance will be provided through commercial licensing.