NetEase Youdao Launches Open Source Speech Synthesis Engine 'Yimosheng' Supporting Over 2000 Voice Tones


The Zhipu team has open-sourced four core video generation technologies, including GLM-4.6V visual understanding, AutoGLM device control, GLM-ASR speech recognition, and GLM-TTS speech synthesis models, showcasing their latest progress in the multimodal field and laying the foundation for the development of video generation technology.
In the context of rapid development in speech synthesis technology, Mianbi Intelligence and the Human-Computer Speech Interaction Laboratory at the Shenzhen International Graduate School, Tsinghua University (THUHCSI) recently jointly released a new speech generation model - VoxCPM. This model, with a parameter size of 0.5B, is dedicated to providing users with high-quality and natural speech synthesis experiences. The release of VoxCPM marks another milestone in the field of high-fidelity speech generation. The model has achieved industry-leading levels in key indicators such as naturalness, voice similarity, and prosodic expressiveness.
Recently, the French AI laboratory Kyutai announced the official open source of its new text-to-speech model, Kyutai TTS, providing global developers and researchers with a high-performance, low-latency speech synthesis solution. This breakthrough release not only promotes the development of open-source AI technology but also opens up new possibilities for multilingual voice interaction applications. AIbase provides an exclusive analysis of this technological highlight and its potential impact. Ultra-low latency, a new experience in real-time interaction. Kyutai TTS has become an industry standout with its exceptional performance.
On July 3rd, the French AI research institution Kyutai Labs announced the open source of its latest text-to-speech (TTS) technology - Kyutai TTS, offering developers and AI enthusiasts an efficient and real-time speech generation solution. Kyutai TTS features low latency and high-fidelity sound, supporting streaming text, allowing audio generation to start without requiring the full text, especially suitable for real-time interaction scenarios. Kyutai TTS performs excellently. It uses a single NVIDIA L40S GPU
Sesame's newly released Conversational Speech Model (CSM) has recently sparked heated discussions on X, lauded as a voice model that sounds "just like a real person." Its stunning naturalness and emotional expressiveness not only make it indistinguishable from human speech for users, but also claim to have successfully overcome the uncanny valley effect in the field of voice technology. With the spread of demonstration videos and user feedback, CSM is rapidly becoming a leader in AI voice technology.