On May 29, the Variable Robot team officially released WALL-WM, the world's first embodiment intelligence world model based on "event-level prediction." This model breaks through the limitations of traditional embodiment large models that mechanically learn actions frame by frame over time, completely switching the prediction unit of the world model to semantic events, marking a new stage in the robot's ability to understand and perform tasks.

In the current embodiment intelligence industry, mainstream vision-language-action (VLA) models generally adopt a mode where given the current image and instruction, they predict a fixed-length action block. This clumsy training method, which fills in frame by frame, often causes robots to remember only minor physical movements while ignoring the ultimate goal of the action. When facing scenarios like changing cups or tables, robots are prone to "fail" due to a lack of generalization capability. To address this industry pain point, the Variable team pointed out in their relevant academic paper that text, vision, and action information naturally exist at different time scales and manifold geometries in the real world. Forcing alignment in a single shared space easily damages the pre-training geometric prior.
Against this industry challenge, the WALL-WM world model introduces an innovative "event-centered" training and execution mechanism. It divides complex tasks for robots into event joints with clear semantic meaning, such as reaching, grasping, and moving. In actual operation, the model no longer rigidly calculates the next frame of the image but first performs a forward simulation of how the world will change due to the next event, then precisely translates this visual change into the motion trajectory of the robotic arm.

To ensure this new architecture can be stably implemented in the real physical world, the Variable Robot team conducted a series of hard-core engineering restructurings. The system not only supports flexible switching between "event mode" with variable-length action output and "unified mode" with real-time closed-loop control on the same base weight, but also achieves one-way coupling division between video models and action models, effectively preventing the valuable dynamic prior in internet videos from being prematurely biased by action data. In addition, for multi-camera devices' geometric perception, the model introduces a frustum mask and a tubular mask mechanism, forcing AI to establish cross-view true three-dimensional geometric correspondence capabilities; regarding decision delay issues, it adopts a new "stepped thinking chain decoding" technology, significantly reducing decoding delay while maintaining logical interpretability.

