Article Content

SALMONN Framework: Expanding General Auditory Capabilities of Large Language Models

Published in Latest AI News

Time :Nov 29, 2023

Read :1minute

SALMONN is an audio-text multimodal large language model framework designed to expand the understanding and processing capabilities of large language models in the general auditory domain. The framework integrates components such as non-speech BEATs audio encoders, the OpenAI Whisper framework's speech encoders, and window-level Q-Former, achieving high levels of temporal resolution for audio-text alignment. After the activation adjustment phase, SALMONN has achieved competitive performance in tasks such as audio captioning and speech translation, demonstrating general auditory capabilities.

Related Recommendations

Bursting with Popularity! Academic Team Breaks the Monopoly of Tech Giants with SFT, OpenSeeker-v2 Ranks at the Top of the Search Agent Rankings

Academic team releases OpenSeeker-v2, breaking industry dominance in deep search. It achieves top-tier agent capabilities with high-quality data, bypassing resource-heavy pipelines (pretraining, CPT, SFT, RL), offering a new paradigm for LLMs.....

May 6, 2026

239.7k

DeepSeek Core Experts Join, Yuanrong Qixing Fully Shifts to Large Model Technology Roadmap

At the Beijing Auto Show, Ruan Chong, a former core researcher of DeepSeek's multimodal technology, appeared as the chief scientist of Yuanrong Qixing, marking the company's shift in autonomous driving technology. CEO Zhou Guang stated that multimodal large models achieved breakthroughs in early 2026, and the advantages of the autonomous driving route based on large models are significant, surpassing previous technologies.

Apr 27, 2026

163.9k

Xiaohongshu Suddenly Open-Sources a Training Engine, RelaX AI Circle Gains Another Significant Player

Xiaohongshu open-sources the RelaX reinforcement learning training engine, designed specifically for multimodal and agent scenarios, supporting unified processing of text, images, audio, and video, accurately aligning with the development trends of the AI industry.

Apr 15, 2026

252.9k

ByteDance Volcano Engine Seedance 2.0 Officially Opens Application for General API Customers

ByteDance's Volcano Engine opened public API applications for the Seedance2.0 multimodal video generation model on April 2, transitioning from limited testing to broader availability. The model supports text, image, audio, and video inputs, enabling character consistency, director-level shot control, and physical simulation.....

Apr 2, 2026

269.9k

Integrate Sora and Veo! Zhixiang Future Launches Its First Multimodal Lobster Application HiDreamClaw

ZhiXiang Future launches HiDreamClaw, a multimodal native app integrated into its creative platform, now available overseas. It features strong compatibility and combines proprietary and advanced models, advancing the company's AI creative ecosystem.....

Mar 24, 2026

192.9k

Intelligent Future, Your Artificial Intelligence Solution Think Tank

English 简体中文繁體中文にほんご