Zhiyuan Robotics announced the launch of a unified world model platform for real-world robot control - Genie Envisioner (GE). This innovative platform breaks through the traditional staged development mode of robot learning systems, integrating future frame prediction, policy learning, and simulation evaluation into a closed-loop architecture centered on video generation, achieving end-to-end reasoning and execution from "seeing" to "thinking" and then "acting." Based on approximately 3000 hours of real robot operation video data, GE demonstrates significant advantages in cross-platform generalization and long-term task execution, opening up a new technical path for embodied intelligence from visual understanding to action execution.

GE's core breakthrough lies in building a vision-centered modeling paradigm based on world models. Unlike mainstream vision-language-action (VLA) methods, GE directly models the dynamic interaction between robots and the environment in the visual space, fully preserving spatial structure and temporal evolution information during the operation process. This modeling approach not only gives GE efficient cross-body generalization capabilities, allowing it to achieve cross-platform migration with very little data, but also shows great advantages in precise execution of long-term tasks. For example, in ultra-long-step tasks such as folding boxes, the success rate of GE-Act far exceeds existing top methods.

WeChat screenshot_20250814165048.png

The GE platform consists of three tightly integrated components: GE-Base, GE-Act, and GE-Sim. GE-Base is the core foundation of the entire platform, using an autoregressive video generation framework with multi-view generation capabilities and a sparse memory mechanism, capable of handling operation scenarios from multiple views and enhancing long-term reasoning capabilities through random sampling of historical frames. GE-Act, as a plug-and-play action module, converts visual latent representations into executable robot control commands through a lightweight architecture and uses asynchronous reasoning to achieve efficient real-time control. GE-Sim extends the generation capabilities of GE-Base into an action-conditioned neural simulator, achieving accurate visual prediction through a hierarchical action-conditioning mechanism, supporting closed-loop policy evaluation, and serving as a data engine to generate diverse training data.

In addition, the Zhiyuan Robotics team has developed the EWMBench evaluation suite to assess the quality of world models for embodied tasks. In comparisons with multiple advanced models, GE-Base achieved the best results on multiple key indicators, and its performance was highly consistent with human judgment. Zhiyuan Robotics plans to open-source all code, pre-trained models, and evaluation tools of GE, promoting the transformation of robots from passive execution to active "imagine-validate-act." In the future, GE will expand to more sensor modalities, support full-body movement and human-robot collaboration, and continue to promote the practical application of intelligent manufacturing and service robots.

🔹 Project page

https://genie-envisioner.github.io/ 

🔹 Arxiv

https://arxiv.org/abs/2508.05635 

🔹Github

https://github.com/AgibotTech/Genie-Envisioner