Alibaba Tongyi Lab has officially open-sourced the next-generation end-to-end speech interaction large model, Fun-Audio-Chat-8B. This model centers on ultra-low latency and natural, smooth speech interaction, marking a new stage in open-source speech AI. It not only understands user speech in real-time but also has strong emotional perception capabilities, with performance comparable to closed-source giants like GPT-4o Audio and Gemini 2.5 Pro. AIbase exclusive analysis: Fun-Audio-Chat is not just a simple chat tool, but a true "AI voice partner."

image.png

Users simply need to speak, and the model can instantly understand, think, and respond naturally. This completely eliminates the latency issues of traditional ASR+LLM+TTS multi-module integration, achieving an end-to-end Speech-to-Speech (S2S) architecture, making the interaction experience closer to human conversation. Core technical highlights Ultra-low latency and efficient design: it uses an innovative dual-resolution architecture (5Hz shared backbone + 25Hz detailed head), saving nearly 50% GPU computing resources and significantly improving response speed, suitable for real-time scenario deployment.

Empathetic emotional understanding: the model can perceive user emotions (such as happiness, fatigue, or anger) from details like tone, speech rate, and pauses, even if they are not explicitly expressed, and provide considerate, empathetic responses, making the interaction more human-like.

Powerful voice function calling: supports Voice Function Calling, allowing users to perform complex tasks through natural voice commands, such as "play music for me" or "make a call," truly achieving "speaking instead of touching."

image.png

Leading performance: Fun-Audio-Chat-8B ranks first among models of the same size in multiple international authoritative benchmark tests, including OpenAudioBench, MMAU, Speech-ACEBench, and VStyle. Its comprehensive capabilities exceed open-source competitors like GLM4-Voice, Kimi-Audio, and Baichuan-Omni, with some indicators comparable to or surpassing closed-source top models.

Rich application capabilities: Real-time answering of voice questions (such as summarizing a piece of audio content);

Accurate recognition of emotion, voice, and commands;

Support for multilingual translation and role-playing;

Simulating various emotional voice outputs (such as gentle, serious, or happy);

Applicable to scenarios such as emotional companionship, smart device control, and voice customer service.

AIbase opinion: This open source includes complete 8B model weights, inference code, and function call examples, greatly lowering the development threshold and promoting the rapid development of the speech AI ecosystem. Interested developers can immediately go to GitHub, Hugging Face, or ModelScope to download and experience it, and start your own "emotionally intelligent" voice AI era!

Project address: https://funaudiollm.github.io/funaudiochat/