Amazon Launches New ASR System Supporting Over 100 Languages


iFlytek launched its AI hardware and software integrated solution at the 2025 1024 Developer Festival. By deeply integrating algorithms and hardware, it solves recognition challenges in complex environments such as high noise and far-field conditions, improving the accuracy of voice and visual intelligence, marking a significant breakthrough in this field.
Recently, Tongyi Lab of Alibaba officially released its latest end-to-end speech recognition large model - FunAudio-ASR. The biggest highlight of this model is its innovative "Context Module," which significantly improves the accuracy of speech recognition in high-noise environments. The hallucination rate has been reduced from 78.5% to 10.7%, a decrease of nearly 70%. This technological breakthrough has set a new benchmark for the speech recognition industry, especially suitable for noisy environments such as meetings and public places. FunAudio-AS
Recently, the OpenAI Evals tool received a significant and exciting update, adding native audio input and evaluation features. This innovation means that developers can now evaluate speech recognition and generation models directly using audio files, without going through the cumbersome process of text transcription. This change greatly simplifies the evaluation process, making the development of audio applications more efficient. In previous evaluation processes, developers often needed to first convert audio content into text, which was time-consuming and labor-intensive, and could also affect
Recently, significant progress has been made in multimodal large language models (MLLMs), particularly in the integration of visual and text modalities. However, with the increasing prevalence of human-computer interaction, the importance of the speech modality has become more prominent, especially in multimodal dialogue systems. Speech is not only a key medium for information transmission but also significantly enhances the naturalness and convenience of interactions. Nevertheless, due to the inherent differences between visual and speech data, integrating them into MLLMs is not an easy task. For example, visual data conveys spatial information, while speech data conveys information in a temporal sequence.
Nexa AI recently unveiled its new OmniAudio-2.6B audio language model, designed to meet the efficient deployment demands of edge devices. Unlike traditional architectures that separate automatic speech recognition (ASR) and language models, OmniAudio-2.6B integrates Gemma-2-2b, Whisper Turbo, and a custom projector into a unified framework. This design eliminates the inefficiencies and delays associated with linking various components in traditional systems, making it especially suitable for resource-constrained computing.