French AI leader Mistral AI has officially launched two new speech-to-text models, aiming to redefine industry standards for transcription speed, privacy protection, and cost-effectiveness.
The newly released models include Voxtral Mini Transcribe V2 and Voxtral Realtime, both part of the Voxtral Transcribe2 system. These models offer top-tier transcription quality, speaker identification (Diarization), and extremely low latency, suitable for a variety of business scenarios such as virtual assistants, call center automation, and compliance recording.

Key Product Highlights:
Voxtral Realtime (Real-time Processing): Designed specifically for live audio, it uses an innovative streaming architecture. Its delay can be configured as low as 200 milliseconds. At a 480-millisecond delay, the error rate is only 1%-2%, almost equivalent to the accuracy of offline transcription. This model has only 4 billion parameters and supports running on local devices such as smartphones or laptops, greatly ensuring privacy security. It is now open-source on the Hugging Face platform under the Apache 2.0 license, with an API price of $0.006 per minute.
Voxtral Mini Transcribe2 (Batch Processing): Specifically designed for pre-recorded files. It supports single requests of up to 3 hours and offers accurate speaker labeling and timestamps. It performs well in the FLEURS word error rate benchmark test, and its API price is only $0.003 per minute, which Mistral AI calls the most cost-effective transcription solution in the current market.
Both models natively support 13 languages, including Chinese, English, French, and Japanese. Users can currently experience them on Mistral AI's Audio Playground or Le Chat assistant.
Key Points:
🚀 Outstanding Performance: The real-time model has a delay as low as 200ms, while the offline model has a significant advantage in word error rate (WER).
🔒 Local Deployment: A lightweight design with 4B parameters supports running on local devices without uploading to the cloud, ensuring privacy security.
💰 High Cost-Effectiveness: The batch transcription API is as low as $0.003 per minute, striving to establish a pricing advantage in the enterprise market.
🌍 Multi-language Support: Natively supports 13 major languages worldwide, covering most commercial application scenarios.
