Amazon Launches New ASR System Supporting Over 100 Languages

Amazon has released a next-generation ASR system that covers over 100 languages, providing comprehensive automatic speech recognition services. The speech foundation model improves accuracy by 20% to 50%, with enhancements of 30% to 70% in challenging areas such as telephone speech. The system supports multiple features, including automatic punctuation, custom vocabulary, automatic language identification, and speaker separation. Thousands of businesses are leveraging Amazon Transcribe to unlock insights from audio content, enhancing accessibility and discoverability.

iFlytek Launches AI Hardware and Software Integrated Solution: Accurate Recognition Even in 90 Decibel Noise

iFlytek launched its AI hardware and software integrated solution at the 2025 1024 Developer Festival. By deeply integrating algorithms and hardware, it solves recognition challenges in complex environments such as high noise and far-field conditions, improving the accuracy of voice and visual intelligence, marking a significant breakthrough in this field.

Alibaba Launches Revolutionary Speech Recognition Model FunAudio-ASR with Remarkable Noise Reduction

Recently, Tongyi Lab of Alibaba officially released its latest end-to-end speech recognition large model - FunAudio-ASR. The biggest highlight of this model is its innovative "Context Module," which significantly improves the accuracy of speech recognition in high-noise environments. The hallucination rate has been reduced from 78.5% to 10.7%, a decrease of nearly 70%. This technological breakthrough has set a new benchmark for the speech recognition industry, especially suitable for noisy environments such as meetings and public places. FunAudio-AS

OpenAI Evals Adds Native Audio Input and Evaluation Features

Recently, the OpenAI Evals tool received a significant and exciting update, adding native audio input and evaluation features. This innovation means that developers can now evaluate speech recognition and generation models directly using audio files, without going through the cumbersome process of text transcription. This change greatly simplifies the evaluation process, making the development of audio applications more efficient. In previous evaluation processes, developers often needed to first convert audio content into text, which was time-consuming and labor-intensive, and could also affect

Chinese Visual and Speech Open Source Model VITA-1.5 Released with GPT-4o Level Advanced Speech and Visual Capabilities

Recently, significant progress has been made in multimodal large language models (MLLMs), particularly in the integration of visual and text modalities. However, with the increasing prevalence of human-computer interaction, the importance of the speech modality has become more prominent, especially in multimodal dialogue systems. Speech is not only a key medium for information transmission but also significantly enhances the naturalness and convenience of interactions. Nevertheless, due to the inherent differences between visual and speech data, integrating them into MLLMs is not an easy task. For example, visual data conveys spatial information, while speech data conveys information in a temporal sequence.

Nexa AI Launches OmniAudio-2.6B: A Fast Audio Language Model for Edge Deployment

Nexa AI recently unveiled its new OmniAudio-2.6B audio language model, designed to meet the efficient deployment demands of edge devices. Unlike traditional architectures that separate automatic speech recognition (ASR) and language models, OmniAudio-2.6B integrates Gemma-2-2b, Whisper Turbo, and a custom projector into a unified framework. This design eliminates the inefficiencies and delays associated with linking various components in traditional systems, making it especially suitable for resource-constrained computing.