Video world models are undergoing a fundamental transformation from a single perspective to multi-agent collaboration. Traditional video world models are mostly based on the assumption of a single agent, making it difficult to handle complex scenarios where multiple players operate and observe each other simultaneously in the same virtual world. To break through this architectural bottleneck, NVIDIA has jointly launched a new multi-agent world model solution called Gamma-World (γ-World) with Tsinghua University, the University of Toronto, and the Vector Institute.
The core challenge of multi-agent world modeling lies in maintaining three types of consistency: time, cross-perspective, and interaction. Previous research such as Solaris has made progress in two-person collaboration, but it has exposed two major defects: identity encoding disruption of permutation symmetry and a fully connected attention mechanism that causes computational costs to grow quadratically with the number of participants, making it impossible to scale to more agents effectively.

To address these structural shortcomings, Gamma-World re-designed from the ground up. First, the team innovatively proposed "Simplex Rotary Agent Encoding." By placing all players at the vertices of a geometric simplex, it achieves natural equidistance and equal status for all players. This design contains no learnable parameters and randomly assigns coordinates, allowing the model to achieve "two-person data training and four-person scene direct execution" without changing the architecture, achieving a leap in generalization.
Second, to solve the computing power throughput bottleneck, Gamma-World introduced the "Sparse Hub Attention Mechanism." This design completely abandons the traditional pairwise direct communication mode and instead uses a set of learnable hub tokens as a compressed relay station for shared world states, successfully reducing the computational cost to linear complexity. With independent caching technology, the system achieved real-time action response simulation at 24 frames per second (24FPS).
In terms of training, the project adopted a three-stage teacher-student distillation method, using a bidirectional teacher model to guide a causal student model, successfully compressing multi-step sampling into four-step sampling. This not only ensured controllability of actions but also effectively alleviated error accumulation during autoregressive inference.
Experimental data shows that in five core scenarios of the multi-player Minecraft virtual environment—memory, construction, etc.—Gamma-World achieved comprehensive superiority over existing state-of-the-art models, with an average reduction of over 40% in the FVD metric used to evaluate video quality. In addition, the framework has been successfully migrated to real dual-arm robot collaborative tasks, fully verifying its universal applicability across scenarios. This not only marks an improvement in multi-agent simulation capabilities, but also has the potential in the future to provide a new large-scale simulation generation infrastructure for physical AI fields such as multi-arm medical collaboration, factory multi-robot scheduling, and autonomous driving.
