Apple has officially released its new video generation model STARFlow-V, which is fundamentally different from current mainstream competitors such as Sora, Veo, and Runway in terms of underlying technology. STARFlow-V abandons the widely used diffusion model in the industry and instead adopts the **"normalizing flow"** technology, aiming to address issues of stability and error accumulation in long video segment generation.

Diffusion models generate videos through multiple iterative steps to remove noise, while the core "normalizing flow" technology of STARFlow-V directly learns the mathematical transformation between random noise and complex video data. This fundamental difference brings several advantages:
Training efficiency: The training process is completed in one go, without the need for multiple small iterations, improving efficiency.
Generation speed: After training, videos can be generated directly without iterative calculations, leading to a significant improvement in speed.
Error reduction: It reduces common errors during the step-by-step generation process.
Apple stated that STARFlow-V is the first technology of its kind that can match diffusion models in both visual quality and speed. By using parallel processing and reusing previous frame data, it can generate a five-second video approximately 15 times faster than the initial version.
Dual Architecture to Tackle Long Video Challenges
Generating long sequences remains a challenge for current video AI technologies, as frame-by-frame generation can lead to error accumulation. STARFlow-V uses a dual architecture approach to alleviate this issue:
One component manages the temporal sequence across frames (motion consistency).
The other component optimizes details within individual frames (image quality).
This design allows STARFlow-V to maintain stability in a 30-second demonstration clip, while competitors like NOVA and Self-Forcing begin to show blur or color distortion after just a few seconds.

Multi-functionality and Performance
The model can handle various tasks without modification, including:
Text-to-video.
Image-to-video, where the input image serves as the starting frame.
Video editing, allowing users to add or remove objects.
In the VBench benchmark test, STARFlow-V scored 79.7 points. Although it lags behind top diffusion models like Veo3 (85.06) and HunyuanVideo (83.24), it significantly outperforms other autoregressive models, especially excelling in spatial relationships and human representation.
Despite the significant technological innovation, STARFlow-V still has limitations: the resolution is relatively low (640×480, 16 frames per second), and it currently cannot be used in real-time on standard GPUs.
More importantly, the model has clear shortcomings in physical simulation, such as phenomena like "an octopus passing through glass" or "a stone appearing out of nowhere."
Apple acknowledges these limitations and plans to focus on accelerating computation speed, reducing model size, and using training data with greater physical accuracy in future work. The relevant code has been published on GitHub, and model weights will follow on Hugging Face.
