Recently, a research team from institutions including MIT CSAIL, the University of Göttingen, and IBM Research has proposed a new audio question-answering model called Omni-R1. Built on the foundation of Qwen2.5-Omni, this model is optimized using a reinforcement learning method named GRPO (Group Relative Policy Optimization), demonstrating excellent performance in audio question-answering tasks.

Omni-R1 set a new state-of-the-art record in the famous MMAU benchmark test, covering multiple audio categories such as sound, speech, and music. The research team noted that although the model's training involved audio data, the main reason for its improved performance was the enhancement of text reasoning capabilities. This discovery was surprising because even when only fine-tuned with text data, the model showed significant improvement.
To achieve this, researchers used ChatGPT to generate a large amount of audio question-answering data, creating two new datasets: AVQA-GPT and VGGS-GPT. These datasets contain 40,000 and 182,000 audio data points respectively, further enhancing the accuracy of Omni-R1. During the training process, Omni-R1 outperformed previous baseline models, including SARI, achieving an average score of 71.3%. The study shows that while audio-based fine-tuning slightly outperforms text-only approaches, the latter also makes a significant contribution.
A key advantage of the GRPO method is its memory efficiency, allowing it to run effectively on GPUs with 48GB of memory. By comparing grouped outputs based on the correctness of answers, this method provides rewards without the need for complex value functions. Researchers enhanced the training data by expanding the audio description of Qwen-2Audio, making the model more competitive in multimodal tasks.
Omni-R1 not only sets a new benchmark in the field of audio question-answering but also highlights the importance of text reasoning in improving audio model performance. In the future, the research team promises to release all related resources so that more researchers and developers can benefit from this achievement.
Paper: https://arxiv.org/abs/2505.09439
Key Points:
🔍 Omni-R1 is an audio question-answering model built on the Qwen2.5-Omni model, optimized using the GRPO reinforcement learning method.
📈 The model achieved new heights in the MMAU benchmark test, with the enhancement of text reasoning capabilities being the main reason.
🛠️ The research team created new datasets using ChatGPT, greatly improving the model's training effectiveness and accuracy.
