Article Content

Aliyun Open-Sources Fun-CineForge: The First Movie-Level Multimodal Dubbing Large Model and Large-Scale Dataset Launches

Published in Latest AI News

Time :Mar 17, 2026

Read :4minute

Recently, the Fun-CineForge project, developed by the speech team of Alibaba Tongyi Lab in collaboration with University of Science and Technology of China, has officially announced its open source. This project addresses core challenges in film and television dubbing, such as lip synchronization, voice style transfer, and emotional expression, and introduces a set of end-to-end production workflows and large model solutions.

Core Breakthroughs: Solving the "Out of Sync" Pain Points in Film Dubbing

Traditional AI dubbing often faces issues such as mismatched lip movements, mechanical emotions, and difficulty adapting to complex film scenes (such as dialogue and multi-person reverberation). Fun-CineForge has achieved a significant breakthrough through the following two core innovations:

MLLM Dubbing Model: Instead of solely relying on learning audio-video alignment in the lip area, it uses a multimodal large model (MLLM) architecture that can deeply understand the character's identity and emotional fluctuations in the film scene.
CineDub Large-Scale Dataset: It built the first richly annotated Chinese TV show dubbing dataset using an automated pipeline, covering diverse scenarios such as monologues, narration, dialogue, and multiple speakers.

Project Updates and Open Source Plan

This project has seen frequent updates recently, demonstrating a high level of engineering completion:

January to March 2026: Released sample datasets and demo demos for both Chinese (CineDub-CN) and English (CineDub-EN).
March 16, 2026: Officially opened the inference code and model weights (Checkpoints), developers can now obtain related resources via GitHub.
Dataset Access: Currently, several classic series datasets, including the Chinese "Dream of the Red Chamber" and the English "Downton Abbey," are available for research use.

Technical Practice: From "Dialogue" to "Performance"

According to the official demo, the model has shown impressive results in remaking classic series such as "Romance of the Three Kingdoms." By inputting specific "emotional clues (Clue)," the model can accurately capture the transformation of a character's emotions from fear to resistance, achieving high-fidelity voice cloning and natural lip synchronization.

The emergence of Fun-CineForge marks a shift in film and television AI dubbing from simple "text-to-speech" to "automated post-production" with artistic understanding, which is expected to significantly reduce the production costs of dubbed films and television shows.

Project: https://funcineforge.github.io/

Related Recommendations

Tongyi Opensources Its First Film-Level Voice Synthesis Large Model: AI Finally Learns to Speak with Emotion

Alibaba's Tongyi Lab launches Fun-CineForge, the first open-source multimodal AI model for professional dubbing, enhancing emotional expression, ambient sound integration, and lip-sync to advance film industry automation.....

Mar 16, 2026

202.3k

Goodbye Voice Actors? ByteDance's PersonaTalk Achieves Accurate Voiceover with Perfect Expression Details!

Recently, ByteDance developed an AI model called PersonaTalk, which can provide precise voiceovers for videos while perfectly synchronizing lip movements and speaking styles. PersonaTalk is a two-stage framework based on attention mechanisms, comprising geometric structure and facial rendering components. In the first stage, it uses a mixed geometry estimation method to extract the facial geometric coefficients of the speaker from a reference video. It then extracts and encodes audio features from the target audio and learns personalized speaking styles from geometric statistical features.

Oct 28, 2024

485.4k

More Authentic Than Original! Loopy Perfectly Matches Digital Avatars' Voices with Footage, Ending the Frustrating Disconnect

LOOPY technology, jointly developed by ByteDance and Zhejiang University, is an audio-driven video diffusion model designed to address the disconnect between audio and visuals in virtual avatar generation. This technology requires only a single frame of image and audio input to generate realistic, dynamic avatar movements that align with the audio rhythm and emotion, including non-speech actions, emotion-driven eyebrow and eye movements, and natural head movements. The core of LOOPY is its unique long-term motion information capture module, which supports various visual and audio styles for dynamic effects in virtual avatars.

Sep 5, 2024

1,501.1k

Beijing Launches 'Clearing the Web: AI for Good' Special Campaign to Crack Down on Five Types of Online Abnormalities in the AI Field

Beijing launches a month-long campaign to combat AI misuse, targeting issues like AI-generated inappropriate content and unauthorized deepfakes to ensure a healthier online environment and AI development.....

Mar 18, 2026

144.3k

How Long Until Robots Become Popular? Wang Xing: The ChatGPT Moment for Embodied Intelligence Will Take at Least Two to Three More Years

Yushu Technology founder Wang Xingxing stated at the Yabuli Forum that the 'ChatGPT moment' for embodied AI is still two to three years away, marked by robots' AI models achieving human-level performance in about 80% of unfamiliar tasks. Current technology still requires breakthroughs, and the true technological singularity remains distant.....

Mar 18, 2026

174.2k

Intelligent Future, Your Artificial Intelligence Solution Think Tank

English 简体中文繁體中文にほんご