On July 3, the French AI research institute Kyutai Labs announced the open source of its latest text-to-speech (TTS) technology - Kyutai TTS, offering developers and AI enthusiasts an efficient and real-time speech generation solution. Kyutai TTS is highlighted by low latency and high-fidelity sound, supporting streaming text, allowing audio generation to start without requiring the complete text, which is particularly suitable for real-time interactive scenarios.

Kyutai TTS performs excellently. Using a single NVIDIA L40S GPU, the model can handle 32 requests simultaneously, with a latency of only 350 milliseconds. In addition, the system not only generates high-quality audio but also outputs precise word timestamps, which is convenient for real-time subtitle generation or interactive applications, such as the interruption handling function on the Unmute platform.

In terms of language support and quality evaluation, Kyutai TTS currently supports English and French, with word error rates (WER) of 2.82 and 3.29 respectively, demonstrating high accuracy. Speaker similarity reaches 77.1% (English) and 78.7% (French), ensuring natural and close-to-original sample voice. The model can also process long articles, breaking through the traditional 30-second limit of TTS, making it suitable for generating long content such as news and books.

Kyutai TTS uses a delay stream modeling (DSM) architecture, combined with a Rust server for efficient batch processing. It has opened its source code and model weights on GitHub and Hugging Face, helping global developers promote innovations in speech technology.