Concerned with the long-standing issues of "character distortion" and "environmental flickering" in the AI video generation field, ByteDance and the Nanyang Technological University research team recently jointly launched an innovative system called StoryMem. This system successfully achieves high consistency in long video cross-scene creation by introducing a mechanism similar to human memory, solving the visual bias problems that models like Sora and Kling often encounter during multi-shot storytelling.

QQ20260104-095251.png

The core logic of StoryMem lies in its unique "hybrid memory bank" design. Researchers pointed out that forcing all scenes into a single model leads to a sharp increase in computational costs, while segmental generation causes loss of context. To address this, StoryMem selectively stores key frames from previous scenes as references. The algorithm uses dual filters, first selecting visual core frames through semantic analysis, then eliminating blurry images through quality checks. When generating new scenes, these key frames are input into the model along with a technique called RoPE (Rotary Position Embedding). By assigning memory frames "negative time indices," the system guides AI to recognize them as "past events," ensuring character images and background details remain stable throughout the story progression.

QQ20260104-095356.png

Notably, StoryMem's implementation is highly efficient. It runs on the LoRa version of Alibaba's open-source model Wan2.2-I2V, adding only about 7 billion parameters to a base model with 140 billion parameters, significantly lowering the training threshold. In the ST-Bench benchmark test containing 300 scene descriptions, StoryMem improved cross-scene consistency by 28.7% compared to the base model and outperformed existing cutting-edge technologies such as HoloCine in aesthetic scores and user preferences.

In addition, the system demonstrates high practical value, supporting users to upload custom photos as "memory start points" to generate coherent stories and enabling smoother scene transitions. Although there are still limitations in handling multiple characters simultaneously and large-scale action transitions, the team has already released weight data on Hugging Face and launched a project page for developers to explore.

Address: https://kevin-thu.github.io/StoryMem/

https://huggingface.co/Kevin-thu/StoryMem