Recently, the Fun-CineForge project, developed by the speech team of Alibaba Tongyi Lab in collaboration with University of Science and Technology of China, has officially announced its open source. This project addresses core challenges in film and television dubbing, such as lip synchronization, voice style transfer, and emotional expression, and introduces a set of end-to-end production workflows and large model solutions.

Core Breakthroughs: Solving the "Out of Sync" Pain Points in Film Dubbing
Traditional AI dubbing often faces issues such as mismatched lip movements, mechanical emotions, and difficulty adapting to complex film scenes (such as dialogue and multi-person reverberation). Fun-CineForge has achieved a significant breakthrough through the following two core innovations:
MLLM Dubbing Model: Instead of solely relying on learning audio-video alignment in the lip area, it uses a multimodal large model (MLLM) architecture that can deeply understand the character's identity and emotional fluctuations in the film scene.
CineDub Large-Scale Dataset: It built the first richly annotated Chinese TV show dubbing dataset using an automated pipeline, covering diverse scenarios such as monologues, narration, dialogue, and multiple speakers.
Project Updates and Open Source Plan
This project has seen frequent updates recently, demonstrating a high level of engineering completion:
January to March 2026: Released sample datasets and demo demos for both Chinese (CineDub-CN) and English (CineDub-EN).
March 16, 2026: Officially opened the inference code and model weights (Checkpoints), developers can now obtain related resources via GitHub.
Dataset Access: Currently, several classic series datasets, including the Chinese "Dream of the Red Chamber" and the English "Downton Abbey," are available for research use.
Technical Practice: From "Dialogue" to "Performance"
According to the official demo, the model has shown impressive results in remaking classic series such as "Romance of the Three Kingdoms." By inputting specific "emotional clues (Clue)," the model can accurately capture the transformation of a character's emotions from fear to resistance, achieving high-fidelity voice cloning and natural lip synchronization.
The emergence of Fun-CineForge marks a shift in film and television AI dubbing from simple "text-to-speech" to "automated post-production" with artistic understanding, which is expected to significantly reduce the production costs of dubbed films and television shows.
Project: https://funcineforge.github.io/
