French artificial intelligence startup Mistral AI recently announced the launch of a new series of speech-to-text models — Voxtral Transcribe2. This series includes two models optimized for different application scenarios, aiming to solve the pain points of high latency and cost in voice interaction.

image.png

Among them, the most attention-grabbing is the real-time transcription model called Voxtral Realtime. This model has a parameter scale of 4B (4 billion) and uses an innovative streaming architecture. Its core highlight is extreme response speed: the model can transcribe audio input instantly and synchronously. Official data shows that the transcription delay has been reduced to below 200ms (0.2 seconds). This means that in real-time conversation or simultaneous interpretation scenarios, users almost feel no processing pause. To promote the development of the developer community ecosystem, Mistral AI has officially opened the model weights under the Apache 2.0 license.

The other model, Voxtral Mini Transcribe V2, focuses on large-scale processing and high cost-effectiveness. This model is specifically designed for processing long audio, supporting recording files up to 3 hours in a single request. In terms of accuracy, Mistral's official statement says this model has surpassed GPT-4o mini Transcribe and Gemini2.5Flash.

In terms of language support and cost, both new models have excellent universality, supporting 13 mainstream languages including Chinese. The pricing strategy is also very competitive: the offline batch processing API costs $0.003 per minute, while the real-time version API, which pursues optimal performance, costs $0.006 per minute.

Key Points:

  • Extremely Low Latency: The Voxtral Realtime model reduces transcription delay to within 200ms, supports instant audio transcription, and has already open-sourced the model weights.

  • 🏆 High Cost-Effectiveness: The Voxtral Mini version outperforms similar products like GPT-4o mini in accuracy, supports ultra-long recordings of 3 hours, and offers highly competitive pricing.

  • 🌐 Multi-language Support: The entire series of models natively supports 13 languages including Chinese, widely adapting to globalized voice office and real-time interaction scenarios.