Recently, researchers officially released the LPM1.0 model. This research project aims to generate videos of people performing actions such as speaking, listening, and singing using a single reference image in real time. The core breakthrough of LPM1.0 lies in its multimodal processing capabilities, which can synchronize and integrate text, audio, and image inputs to generate dynamic scenes with precise lip synchronization, delicate facial expressions, and natural emotional transitions. The model supports direct integration with mainstream voice AI platforms such as ChatGPT and Doubao, thereby upgrading traditional voice conversations into real-time interactive experiences with visual feedback.
On the technical level, LPM1.0 introduces a "multi-granularity identity conditioning" technology, which extracts details from reference materials with multiple angles and expressions, without requiring the model to generate complex features such as teeth, wrinkles, or side contours on its own. This significantly enhances cross-style processing capabilities. Whether it's photo-realistic human faces, animations, or 3D game characters, the model can achieve instant driving without secondary training. In addition, the model supports streaming transmission technology, maintaining system stability even when generating videos up to 45 minutes long.
In terms of interaction logic, LPM1.0 can accurately identify three dialogue states: when listening, it generates reactive expressions such as nodding or shifting gaze; when speaking, it drives body movements and lip movements based on audio; when idle, it produces natural leisure behaviors according to text instructions. Project manager Zeng Ailing pointed out that LPM1.0 is not only suitable for real-time conversations but also supports offline audio-driven video generation, providing technical redundancy for podcasts and film and television production.
Although it shows strong application potential, the development team emphasized that LPM1.0 is currently only a research project, and there are no plans to publicly release code or weights at this stage. Researchers admitted that there is still a qualitative gap between the generated videos and real footage, and the deepfake risks inherent in the technology cannot be ignored. The significance of this research lies in clarifying the future direction of AI system evolution: moving from single logical interaction to a multidimensional interaction form with emotional response, eye contact, and visual embodiment.
