When AI speech not only "sounds like a human" but also "sounds like you," and is fast enough to be almost imperceptible, the boundaries of voice interaction are being completely redefined. On the early morning of October 30th, MiniMax Xiyu Technology officially launched its next-generation text-to-speech model - MiniMax Speech 2.6, which brings real-time performance with end-to-end latency below 250 milliseconds and the revolutionary Fluent LoRA voice cloning technology, pushing voice generation into a new era of high naturalness, low latency, and strong personalization.
Within 250 Milliseconds: Real-Time Response Close to Human Conversation
In voice interaction scenarios, latency is the lifeline of the experience. Speech 2.6 achieves end-to-end latency below 250 milliseconds from text input to audio output through deep architecture optimization, matching the rhythm of natural human conversation. This means that in high-demand scenarios such as smart customer service, real-time subtitles, and virtual anchors, AI speech no longer lags behind, truly achieving smooth dialogue and immersive interaction.
Fluent LoRA: Clone Your Unique Voice with 30 Seconds of Audio
The biggest breakthrough this time is the deep integration of Fluent LoRA (Low-Rank Adaptation) technology. Users need only provide a reference audio of more than 30 seconds, and the model can accurately capture the speaker's voice, tone, rhythm, and even emotional style, generating natural speech that highly matches the target text. Whether it's cloning your own voice to tell a bedtime story or customizing a virtual brand ambassador, voice cloning has never been so simple, efficient, and realistic.
More importantly, Fluent LoRA significantly improves the fluency of speech while ensuring consistent voice quality, avoiding common issues in traditional TTS such as "mechanical sentence breaks" or "emotional misalignment," making synthesized speech truly expressive.
Full Scene Coverage: From Personal Creation to Enterprise Deployment
MiniMax Speech 2.6 is now available for both individual creators and enterprise customers:
- Educational field: Teachers can quickly generate lecture audio for courseware;
- Customer service: Enterprises can deploy intelligent voice robots with brand-specific voices;
- Smart hardware: In-vehicle and home devices can achieve low-latency, high-fidelity voice interaction;
- Content production: UPs and podcasters can instantly generate multi-character voiceovers, greatly improving creation efficiency.
As a key component of MiniMax's multimodal large model ecosystem, Speech 2.6 not only strengthens its technical depth in the AIGC field, but also marks that text-to-speech synthesis is moving from "functional usability" to a new era of "emotional credibility and customizable personality."
In today's increasingly competitive AI landscape, where attention is focused on "experience details," MiniMax proves that true intelligence is not just about computing fast, but also about speaking like a human and speaking compellingly, with a delay of only 250 milliseconds and the ability to "speak like you."
