Alibaba Tongyi's Major Open-Source Bomb! A Voice AI That Can Understand Emotions Has Arrived, Will GPT-4o Tremble?

Alibaba Tongyi Lab has officially open-sourced the next-generation end-to-end speech interaction large model, Fun-Audio-Chat-8B. This model centers on ultra-low latency and natural, smooth speech interaction, marking a new stage in open-source speech AI. It not only understands user speech in real-time but also has strong emotional perception capabilities, with performance comparable to closed-source giants like GPT-4o Audio and Gemini 2.5 Pro. AIbase exclusive analysis: Fun-Audio-Chat is not just a simple chat tool, but a true "AI voice partner."

Users simply need to speak, and the model can instantly understand, think, and respond naturally. This completely eliminates the latency issues of traditional ASR+LLM+TTS multi-module integration, achieving an end-to-end Speech-to-Speech (S2S) architecture, making the interaction experience closer to human conversation. Core technical highlights Ultra-low latency and efficient design: it uses an innovative dual-resolution architecture (5Hz shared backbone + 25Hz detailed head), saving nearly 50% GPU computing resources and significantly improving response speed, suitable for real-time scenario deployment.

Empathetic emotional understanding: the model can perceive user emotions (such as happiness, fatigue, or anger) from details like tone, speech rate, and pauses, even if they are not explicitly expressed, and provide considerate, empathetic responses, making the interaction more human-like.

Powerful voice function calling: supports Voice Function Calling, allowing users to perform complex tasks through natural voice commands, such as "play music for me" or "make a call," truly achieving "speaking instead of touching."

Leading performance: Fun-Audio-Chat-8B ranks first among models of the same size in multiple international authoritative benchmark tests, including OpenAudioBench, MMAU, Speech-ACEBench, and VStyle. Its comprehensive capabilities exceed open-source competitors like GLM4-Voice, Kimi-Audio, and Baichuan-Omni, with some indicators comparable to or surpassing closed-source top models.

Rich application capabilities: Real-time answering of voice questions (such as summarizing a piece of audio content);

Accurate recognition of emotion, voice, and commands;

Support for multilingual translation and role-playing;

Simulating various emotional voice outputs (such as gentle, serious, or happy);

Applicable to scenarios such as emotional companionship, smart device control, and voice customer service.

AIbase opinion: This open source includes complete 8B model weights, inference code, and function call examples, greatly lowering the development threshold and promoting the rapid development of the speech AI ecosystem. Interested developers can immediately go to GitHub, Hugging Face, or ModelScope to download and experience it, and start your own "emotionally intelligent" voice AI era!

Project address: https://funaudiollm.github.io/funaudiochat/

Alibaba Tongyi's Major Open-Source Bomb! A Voice AI That Can Understand Emotions Has Arrived, Will GPT-4o Tremble?

Related Recommendations

Plaud Note Pro Becomes the Preferred Choice for Professional Users! Credit Card-Sized AI Voice Recorder Sold Millions of Units, 30g Ultra-Thin Design + 64GB Local Storage Redefine Meeting Notes Experience

拥有专属AI工具库只需5秒！G123极简导航 + 短链分享，工作效率直接翻倍

Kargo Secures $42 Million in Funding AI Cameras Are Reshaping Logistics Loading and Unloading

Google and OpenAI Image Generation Tools Exploited to一键 Generate Inappropriate Deepfake Photos of Women

Jan Team Releases Jan-v2-VL-Max! A 30B Multimodal Model Specializing in Long-Term Agent Tasks, Stable Execution of Long Sequences Outperforms Gemini 2.5 Pro