Recently, the research paper "Causal World Modeling for Robot Control" co-authored by Ant Lingbo Technology and universities such as the Hong Kong University of Science and Technology has been accepted by the top international robotics academic conference Robotics: Science and Systems (RSS) 2026.
RSS is one of the most prestigious academic conferences in the field of robotics, focusing on cutting-edge areas such as robot learning, control, perception, planning, and systems. The strict acceptance criteria mean that papers accepted by RSS are recognized for their academic innovation and high acclaim within the global robotics research community.
The core of this research is to enable robots not only to perform actions but also to predict how the world will change before taking action. The paper proposes a causal world modeling framework for robot control, which is implemented as the first open-source autoregressive video-action world model globally, LingBot-VA. This model can continuously predict environmental changes while the robot performs tasks and generate next-step action instructions based on these predictions, giving the robot the ability similar to human "observing, judging, and acting" simultaneously.
For Ant Lingbo, the inclusion of this paper in RSS 2026 marks the international recognition of its exploration in the direction of "world model-driven robot control," further verifying the technical value of LingBot-VA as a foundational model for embodied intelligence. In the future, this approach has the potential to drive robots from relying solely on instructions to achieving stronger environmental understanding, task generalization, and autonomous decision-making.

For robots, the real challenge is not just performing actions but understanding the changes those actions bring about. For example, what happens to the desk after picking up a cup, or where objects move after pushing a drawer. The core breakthrough of LingBot-VA lies in introducing the ability to predict future changes into robot control, allowing robots to first predict what the world will look like and then decide how to act based on that prediction.
This is why the paper emphasizes "causal world modeling." The real physical world progresses in time, so when robots predict the future, they must simulate it step by step in the correct chronological order. LingBot-VA embeds this causal relationship into the model structure, ensuring that each prediction is based only on previous observations and actions, proceeding in chronological order. As a result, the model generates not just a video showing the future, but a causal trajectory usable for robot control decisions. This gives the model stronger long-term memory capabilities, especially important for completing long-sequence, multi-step real-world tasks.
In terms of technical implementation, LingBot-VA adopts a Mixture-of-Transformers (MoT) architecture, unifying video prediction and action generation within the same autoregressive diffusion framework. The model also features a closed-loop inference mechanism, continuously receiving feedback from the real environment during task execution to reduce error accumulation over long-term predictions.
The paper systematically verifies the performance of LingBot-VA on simulation benchmarks and real-robot tasks. In the 50 dual-arm manipulation tasks of RoboTwin2.0, LingBot-VA achieves average success rates of 92.0% and 91.1% under Easy and Hard settings, respectively; and reaches 98.5% on the LIBERO benchmark.
In real-world evaluations, facing three major categories of six challenging tasks involving long-term sequences, high precision, and flexible and joint object manipulation, LingBot-VA requires only 50 real demonstration data samples to adapt, achieving a success rate that exceeds industry baselines by more than 20 percentage points, demonstrating strong data efficiency and generalization capability.
LingBot-VA has already made its model weights, training, and inference code available earlier this year. Researchers and developers can access and download them on Hugging Face and GitHub.
Paper link: https://arxiv.org/abs/2601.21998
Project page: https://technology.robbyant.com/lingbot-va
