Robots Say Goodbye to Frame-by-Frame Learning! Global First Event-Level Embodied Intelligence World Model Released

On May 29, the Variable Robot team officially released WALL-WM, the world's first embodiment intelligence world model based on "event-level prediction." This model breaks through the limitations of traditional embodiment large models that mechanically learn actions frame by frame over time, completely switching the prediction unit of the world model to semantic events, marking a new stage in the robot's ability to understand and perform tasks.

In the current embodiment intelligence industry, mainstream vision-language-action (VLA) models generally adopt a mode where given the current image and instruction, they predict a fixed-length action block. This clumsy training method, which fills in frame by frame, often causes robots to remember only minor physical movements while ignoring the ultimate goal of the action. When facing scenarios like changing cups or tables, robots are prone to "fail" due to a lack of generalization capability. To address this industry pain point, the Variable team pointed out in their relevant academic paper that text, vision, and action information naturally exist at different time scales and manifold geometries in the real world. Forcing alignment in a single shared space easily damages the pre-training geometric prior.

Against this industry challenge, the WALL-WM world model introduces an innovative "event-centered" training and execution mechanism. It divides complex tasks for robots into event joints with clear semantic meaning, such as reaching, grasping, and moving. In actual operation, the model no longer rigidly calculates the next frame of the image but first performs a forward simulation of how the world will change due to the next event, then precisely translates this visual change into the motion trajectory of the robotic arm.

To ensure this new architecture can be stably implemented in the real physical world, the Variable Robot team conducted a series of hard-core engineering restructurings. The system not only supports flexible switching between "event mode" with variable-length action output and "unified mode" with real-time closed-loop control on the same base weight, but also achieves one-way coupling division between video models and action models, effectively preventing the valuable dynamic prior in internet videos from being prematurely biased by action data. In addition, for multi-camera devices' geometric perception, the model introduces a frustum mask and a tubular mask mechanism, forcing AI to establish cross-view true three-dimensional geometric correspondence capabilities; regarding decision delay issues, it adopts a new "stepped thinking chain decoding" technology, significantly reducing decoding delay while maintaining logical interpretability.

AutoNavi Launches Full Stack Upgrade of the ABot Embodied System, Unveils Five Core Models at Once

On July 23, Alibaba’s Gaode (Amap) announced a full-stack upgrade of its ABot embodied system, releasing five core models including ABot-N1. It integrates perception, decision-making, execution, and memory, improving robots’ autonomous general task capabilities. It creates the world’s first full-stack system fusing a world model, foundation model, and embodied agent, marking a new stage of collaborative integration in embodied intelligence.....

Wang Xingxing from HuaZhu says at the World Internet Conference: The 'ChatGPT Moment' for Humanoid Robots Is Coming in as Few as Two or Three Years

Humanoid robots have gone from stumbling to dancing and fighting in just a few years, and the critical point from being able to move to being able to work is approaching rapidly. On July 22nd, at the Digital Silk Road Development Forum held in Xi'an, 2026 World Internet Conference, Wang Xingxing, CEO of HuaZhu Technology, proposed an aggressive timetable: the 'ChatGPT Moment' of embodied intelligence will arrive as early as two or three years from now, at which time robots will truly have the capability to work and overcome the high barrier to practical application.

Mingxi Intelligent Open Sources MiniCPM-Robot: Full Release of a 1.5B VLA Model, Personal Developers Can Now Run Operations on Real Robots

In the past, embodied intelligence was dominated by giants and laboratories due to large models and high barriers. On July 19, Mingxi Intelligent open-sourced its first series of embodied AI models, MiniCPM-Robot, fully releasing a 1.5B parameter vision-language-action model, enabling individual developers to perform operations and target tracking on real robots. The openness in the embodied field is rare. This series includes three core components.

AntGroup's LingBot-Video Open Source: The World's First Video Foundation Model for Embodied Intelligence!

AntGroup opensources the world's first video foundation model for embodied intelligence, LingBot-Video. This model restructures pre-training around robot requirements, systematically improving reasoning efficiency, physical plausibility, action understanding, and task completion. It provides an open-source foundation for embodied intelligence and has been validated on the RBench benchmark jointly released by Peking University and ByteDance.

New Breakthrough in Embodied Intelligence: Ant Group Open Sources LingBot-Vision, Enabling Robots to Have a Sense of Space

Ant Group's Robbyant opensources the LingBot-Vision model family, which achieves outstanding performance in dense space perception tasks through self-supervised vision Transformers and innovative boundary modeling. It surpasses large models with several times more parameters in multiple metrics, breaking the limitations of existing visual foundation models that focus heavily on object recognition, making precise perception of physical space by robots a reality.