At the WAIC 2025 World Artificial Intelligence Conference, Shengshu Technology officially launched the "Reference Video" feature for Vidu Q1, revolutionizing the traditional video production process through algorithmic innovation and bringing breakthrough progress to the field of video generation.
Say Goodbye to Storyboarding, Create Videos in One Click
The biggest highlight of "Reference Video" is skipping the complex pre-production storyboarding process. Users just need to upload reference images of characters, props, and scenes, along with text prompts, and they can directly generate complete video content. The production process has been simplified from the traditional "storyboarding — video generation — editing — final video" to "reference images — video generation — editing — final video".
For example, entering the prompt "Zhuge Liang discussing with Churchill and Napoleon in a meeting room" and uploading reference images of the three historical figures and a meeting room scene, the system can generate a complete video showing all three of them in a conversation.

Cracking the Core Challenges of Commercialization
The core advantage of this feature lies in solving the key bottleneck in video model commercialization — the issue of subject consistency. Vidu Q1's Reference Video currently supports up to seven subjects being input simultaneously while maintaining consistency. According to Shengshu Technology, this is sufficient for most creative scenarios.
Lu Yihang, CEO of Shengshu Technology, stated that this general-purpose creation method will better serve diverse commercial scenarios such as advertising, animation, film and television, cultural tourism, and education, achieving an essential shift from offline shooting to online AI creation.
Technical Path and Industrial Orientation
Shengshu Technology uses the U-ViT architecture, combining diffusion models with Transformer technology, and optimizing algorithm modules based on this. The Vidu model has built-in multimodal understanding capabilities and has been successfully applied to video generation.
Lu Yihang emphasized that the team prioritizes industrial application, and has not yet made the integration of understanding and generation as the top priority, saying, "Industry clients care more about content quality than technical approaches."
Expanding the Field of Embodied Intelligence
On July 25, Tsinghua University and Shengshu Technology jointly released the embodied intelligence model Vidar, achieving low-cost, few-shot generalization through the "video large model + embodied intelligence" approach.
Lu Yihang explained that video models and embodied intelligence fundamentally both handle spatiotemporal information and use the same input decision logic. Based on the Vidu video large model, the team can convert virtual videos into corresponding robotic arm movements by training with a small number of robot operation videos, effectively solving the data scarcity problem in traditional VLA approaches.
Currently, Vidu still prioritizes improving video generation capabilities, while treating embodied intelligence as a continuous exploration direction, opening up potential commercial markets for this field.
