StepFun AI Releases New Audio Large Language Model Step-Audio-R1 with Significant Improvement in Audio Reasoning Capabilities

StepFun AI team recently launched the new audio large language model Step-Audio-R1, which can effectively utilize computational resources during generation and reasoning, solving the problem of accuracy decline in current audio AI models when handling long reasoning chains. The research team pointed out that this issue is not an inherent limitation of audio models, but rather due to the use of text-based reasoning during training.

Most current audio models rely mainly on text data during training, causing their reasoning process to resemble reading text rather than actually listening to sound. StepFun team calls this phenomenon "text-based reasoning." To address this issue, Step-Audio-R1 requires the model to reason based on audio evidence when generating answers. This approach is achieved through a training method called "modal reasoning distillation," which specifically selects and refines reasoning paths related to audio features.

In terms of architecture, Step-Audio-R1 is based on the Qwen2 audio encoder, which processes the original waveform and down-samples the output to 12.5Hz through an audio adapter. Then, the Qwen2.532B decoder consumes the audio features and generates text. When generating answers, the model always produces clear reasoning blocks within specific tags, ensuring that the structure and content of the reasoning are optimized without affecting task accuracy.

During training, the model went through a supervised cold start phase and a reinforcement learning phase, involving a mix of text and audio tasks. In the cold start phase, the team used 5 million samples, covering 100 million text tokens and 4 billion audio paired data. In this phase, the model learned how to generate reasoning useful for both audio and text, establishing basic reasoning capabilities.

Through multiple rounds of "modal reasoning distillation," the research team extracted real acoustic features from audio questions and further optimized the model's reasoning ability using reinforcement learning. Step-Audio-R1 performed well in multiple audio understanding and reasoning benchmark tests, with its overall score close to the industry-leading Gemini3Pro model.

Paper: https://arxiv.org/pdf/2511.15848

Key Points:
🔊 Step-Audio-R1 developed by StepFun AI solves the accuracy decline issue in audio reasoning, using a modal reasoning distillation method.
📈 The model is based on the Qwen2 architecture and can clearly distinguish between the thinking process and final answer during reasoning, improving audio processing capabilities.
🏆 Step-Audio-R1 has outperformed Gemini2.5Pro in multiple benchmarks and is comparable to Gemini3Pro.

OpenAI Upgrades Atlas Browser: Supports Multi-Account Login for Separating Work and Personal Data

OpenAI's Atlas browser now adds a multi-account login feature, allowing users to manage multiple ChatGPT accounts such as personal, work, and school accounts within the same browser through separate profile configurations, solving the pain point of previous single browsers being unable to switch identities. The product manager called this feature one of the most anticipated characteristics for users, helping to enhance the user experience of Atlas as a primary browser.

No need to switch screens anymore! Shazam is deeply integrated into ChatGPT, and you can find out the song just by asking.

Apple has deep-integrated Shazam music recognition service into the ChatGPT app. Users can quickly identify songs by entering instructions in the chat interface, without switching apps. This feature is available on iOS, Android, and web versions, offering a simple operation that enhances the convenience of song recognition.

Anthropic Launches AI Code Review Tool Code Review for Claude Code to Automatically Detect Pull Request Vulnerabilities

Anthropic launches an AI code review tool called Code Review, which can automatically identify potential vulnerabilities and alleviate the review pressure in enterprise development processes. The tool is now available, initially offered to team and enterprise customers, aiming to address the review challenges brought by the increasing amount of AI-generated code.

Can a 2% Parameter Model Compete with GPT-4o? Alibaba's Qwen 3.5 Small Model Is Making Waves!

The Alibaba Qwen 3.5 series small model breaks the conventional belief that parameter count determines intelligence. Among them, Qwen 3.5-4B with only 4 billion parameters performs equally well or even slightly better than GPT-4o, which has over 100 billion parameters, in third-party tests. This marks an important breakthrough in local deployment and efficiency optimization for domestic large models, ushering in a new era of 'winning with small size'.

StepFun AI Releases New Audio Large Language Model Step-Audio-R1 with Significant Improvement in Audio Reasoning Capabilities

Related Recommendations

OpenAI Upgrades Atlas Browser: Supports Multi-Account Login for Separating Work and Personal Data

No need to switch screens anymore! Shazam is deeply integrated into ChatGPT, and you can find out the song just by asking.

Anthropic Launches AI Code Review Tool Code Review for Claude Code to Automatically Detect Pull Request Vulnerabilities

Can a 2% Parameter Model Compete with GPT-4o? Alibaba's Qwen 3.5 Small Model Is Making Waves!

News Corp and Meta Reach New Agreement to Promote Artificial Intelligence Development