Apple recently launched its latest video generation model, STARFlow-V, which differs significantly from market competitors such as Sora, Veo, and Runway in terms of technology. The design focus of STARFlow-V is to enhance the stability of long video segments. It uses "normalizing flow" technology instead of the currently mainstream diffusion models.

Apple states that STARFlow-V is the first product capable of matching diffusion models in terms of visual quality and generation speed, although its output resolution is 640×480 pixels and it generates at a rate of 16 frames per second. Unlike diffusion models, which gradually remove noise through multiple iterations, STARFlow-V completes video generation in a single training step by learning direct mathematical transformations between random noise and complex video data, greatly improving training efficiency and reducing errors that may occur during step-by-step generation.
The system can flexibly handle various tasks, including standard text-to-video, image-to-video (using the input image as the starting frame), and video editing functions. For videos longer than the training length, STARFlow-V employs a sliding window technique, retaining the context of the last few frames after generating a segment and continuing to generate. However, the time changes shown in the demonstration clips reveal limited diversity.
When generating long sequences, errors often accumulate due to frame-by-frame generation. To address this issue, STARFlow-V adopts a dual architecture: one part is responsible for managing temporal sequences across frames, while the other focuses on optimizing details of individual frames. To stabilize the optimization process, Apple introduced a certain amount of noise during training, which may cause the video to appear slightly grainy. However, a parallel "causal denoising network" removes residual noise while maintaining motion consistency.
During training, Apple used 70 million text-video data pairs combined with 4 million text-image data pairs, using language models to expand video descriptions into nine different variations. After several weeks of training, the model's parameters increased from 3 billion to 7 billion, and the resolution and video length continued to improve.
Although STARFlow-V scored 79.7 on the VBench benchmark test, slightly lower than some leading diffusion models, its performance in autoregressive models remains outstanding, showing significant advantages in spatial relationships and human performance. In the future, Apple will continue to focus on improving computing speed, optimizing the model, and emphasizing training data with physical accuracy.
Key Points:
🌟 STARFlow-V uses normalizing flow technology to enhance the stability and efficiency of long video segment generation.
⚙️ The model supports various video generation and editing tasks, demonstrating strong flexibility.
🚀 Apple plans to optimize computing speed and physical accuracy in the future, continuously advancing video generation technology.
