Recently, the research team from ByteDance and Nanyang Technological University jointly developed a new system called StoryMem, aimed at solving the problem of inconsistent character appearances in different scenes when generating AI videos. The system stores key frames during the video generation process and references them when generating subsequent scenes, thus maintaining consistency in characters and environments.

image.png

Current AI video generation models, such as Sora, Kling, and Veo, perform well in short segment generation, but still face issues such as changes in character appearance and inconsistent environments when stitching multiple scenes into a coherent story. Previous solutions either require a large amount of computing resources or lose consistency when stitching scenes.

The StoryMem system adopts a new approach. During the video generation process, it stores visually important frames in memory and references them when generating new scenes. The system's algorithm intelligently selects important frames to ensure efficient memory management while retaining important visual information from the beginning of the story. When generating new scenes, these stored frames are input into the model along with the currently created video, ensuring that the generated content remains consistent.

In actual training, StoryMem adopted the Low-Rank Adaptation (LoRA) technique to adapt to the open-source model Wan2.2-I2V from Alibaba. The research team used 400,000 video clips, each five seconds long, for training and grouped these clips based on visual similarity, enabling the model to generate sequels with consistent styles.

According to the research results, StoryMem showed significant improvement in cross-scene consistency, improving the performance by 28.7% compared to the unmodified base model. In addition, user surveys showed that participants preferred the results generated by StoryMem, considering it to be more aesthetically pleasing and consistent.

However, the research team also pointed out some limitations of the system, such as the possibility of incorrect application of character visual features in complex scenes with multiple characters. To address this, it is recommended to clearly describe characters in each prompt to improve the generation results.

Project: https://kevin-thu.github.io/StoryMem/

Key Points:  

🌟 The StoryMem system can effectively solve the problem of inconsistent characters and environments in AI video generation.  

📊 By storing key frames, StoryMem improves cross-scene consistency by 28.7% compared to existing models.  

🛠️ The system still faces challenges in handling complex scenes and requires clear descriptions of characters to improve generation results.