Traditional AI voice dubbing often encounters bottlenecks when dealing with high-standard scenarios such as films and animations, as it struggles to match complex emotional outbursts and precise lip movements. To address this pain point, Tongyi Lab officially released and open-sourced the first film-level multi-scenario multimodal large model for voice dubbing—Fun-CineForge.

Breaking "Audio-Visual Disconnection": Four Strict Dimensions of Collaboration

Unlike traditional models that rely solely on text-to-speech, Fun-CineForge aims to overcome four core challenges in film production:

  • Lip Sync: Achieve a high level of consistency between synthesized speech and the mouth movements in the video.

  • Emotional Expression: Combine facial features and instruction descriptions to give the voice a human-like emotional depth.

  • Voice Consistency: Maintain a stable voice for specific characters in complex multi-character dialogues.

  • Time Alignment: Even when the speaker is obscured or not in the frame, the speech can be inserted at millisecond-level precise time points.

Core Technology: Introducing the "Time Modality" and High-Quality Dataset

The technical breakthrough of Fun-CineForge lies in its unique "data + model" integrated design:

QQ20260316-152310.jpg

  1. CineDub High-Quality Dataset: Tongyi Lab also open-sourced the CineDub automated dataset construction process. This process uses a chain-of-thought error correction mechanism, reducing the transcription error rate of Chinese and English texts to around 1% - 2%, and significantly lowering the speaker separation error rate to 1.2%.

  2. Four-Modality Fusion Architecture: The model introduces "time modality" for the first time, combining visual (lip shape and expression), text (dialogue emotion), and audio (voice reference) for joint modeling. This allows the model to achieve precise synchronization even in complex scenes where faces are not visible.

Outstanding Performance: Filling the Gap in Multi-Person Dialogue Dubbing

Experimental data shows that Fun-CineForge significantly outperforms baseline models such as DeepDubber-V1 in word error rate (WER/CER), lip sync (LSE-C/D), and voice similarity. Notably, the model achieves precise support for duet and multi-person dialogue scenes for the first time, demonstrating strong robustness in video clips within 30 seconds.

  • GitHub: https://github.com/FunAudioLLM/FunCineForge

  • HuggingFace: https://huggingface.co/FunAudioLLM/Fun-CineForge

  • ModelScope: https://www.modelscope.cn/models/FunAudioLLM/Fun-CineForge/