The Meituan LongCat large model team has announced the official open-source release of a commercial-grade digital human video generation model - LongCat-Video-Avatar 1.5. This version marks a comprehensive transition from open-source SOTA (state-of-the-art) to commercial practical applications, achieving significant improvements in core dimensions such as lip synchronization, physical plausibility, long video stability, multi-person interaction, and efficient inference.
Three Major Enhancements: Tackling Commercialization Pain Points
To make digital humans truly suitable for diverse real-world scenarios, LongCat-Video-Avatar 1.5 addresses long-standing issues such as "jitter, distortion, and high latency" in traditional digital human videos with three comprehensive upgrades:
Commercialization of Basic Experience (Audio Encoder Upgrade)
The model upgraded its audio feature extraction encoder from Wav2Vec2 to Whisper-large. With a larger parameter count and richer multilingual prior knowledge, the model can capture phoneme changes and pronunciation rhythm in detail. This not only improves lip movements for complex audio like long sentences, fast speech, and singing, but also enables natural coordination between facial, head, and body movements and speech, significantly reducing frame skipping and identity drift commonly seen in long videos.
Strong Open-Domain Generalization (Multi-Stage Enhanced Data System)
To stabilize processing of various subjects including real people, virtual idols, anime characters, and animals, the team built a multi-stage data processing workflow that includes "offline annotation" and "online validation," and injected three types of enhanced data specifically:
Multi-Person Data: Utilizing active speaker detection, it eliminates audio-visual ambiguity in multi-person scenarios, accurately distinguishing speakers from listeners.
Quiet Data: Selecting videos without speech allows the model to learn natural micro-expressions during non-speaking states, preventing mouth movement in non-speaking characters.
Emotion Data: Combined with frame-level emotion recognition filtering, it injects emotional changes, enabling the model to understand the deep connection between speech and expressions.
Special Alignment for Hands and Continuity (Introducing GRPO)
For scenarios requiring frequent hand exposure, such as e-commerce live streams and product demonstrations, the model introduces GRPO (Human Preference Alignment), refining the reward signal down to the frame level and adding a first-frame hand detection mechanism. This significantly alleviates industry challenges such as hand distortion, local structural collapse, and inconsistent actions.

15x Faster Inference: Say Goodbye to Expensive Compute Power
Another key aspect of commercial applications is cost. LongCat-Video-Avatar 1.5 adopts the DMD (Distributed Matching Distillation) technology, successfully compressing the originally 50-step generation process into 8 steps. Meanwhile, the team replaced the traditional three-model parallel scheme with an architecture of one shared base model + multiple LoRA adapters, significantly freeing up VRAM.
In actual testing, the model achieved approximately 15 times faster inference efficiency, with generating a 10-second video taking about 1 minute.
Authoritative Benchmark Evaluation: Leading Industry Top Models
Based on the EvalTalker benchmark, 770 evaluators and 10 domain experts conducted structured quality analysis of videos covering complex scenarios such as news, education, and entertainment. Data shows that LongCat-Video-Avatar 1.5 performed impressively in several key metrics:
User Preference Win Rate: It outperforms Kling Avatar2.0 by 65.9%, OmniHuman-1.5 by 61.1%, and HeyGen by 54.3%.
Single/Multi-Person Scenario Scores: The single-person scenario score is 3.336, significantly higher than products like HeyGen; the multi-person scenario score is 2.730, greatly surpassing InfiniteTalk (2.339).
Video Stability: The subject deformation rate is only 23.1%, and the background deformation rate is just 9.4%;the frame skipping problem rate is as low as 0.8%, performing best among all compared models.
Audio-Visual Coordination: The face-body synchronization issue rate dropped to 5.1%, and the lip synchronization issue rate dropped to 29.8%, both exceeding traditional commercial systems.
The Meituan LongCat large model team stated that the open-source release of LongCat-Video-Avatar 1.5 is not only an update to the version, but also an invitation to the global developer and creator community. The team hopes this model will become a verifiable and improvable technical foundation, jointly expanding the real application boundaries of digital human videos.
Open Source Links:
Github: https://github.com/meituan-longcat/LongCat-Video
HuggingFace: https://huggingface.co/meituan-longcat/LongCat-Video-Avatar-1.5
Tech Report: https://github.com/meituan-longcat/LongCat-Video/blob/main/assets/LongCat-Video-Avatar-1.5-Tech-Report.pdf
Project Page: https://meigen-ai.github.io/LongCat-Video-Avatar-1.5-Page/
Modelscope: https://www.modelscope.cn/models/meituan-longcat/LongCat-Video-Avatar-1.5/summary
