Ant Group has officially released a visual-language-action (VLA) foundation model named LingBot-VLA. This model focuses on complex manipulation in the real world, and through massive data training, it achieves universal manipulation capabilities on different types of robots, marking another important advancement in the field of embodied intelligence.
To build this powerful model, the research team collected about 20,000 hours of real remote operation data on nine mainstream dual-arm robots, including AgiBot G1 and AgileX. These data cover rich action sequences, and Qwen3-VL automatically generates detailed language instructions, forming a high-quality pre-training dataset.

LingBot-VLA adopts an innovative "hybrid Transformer" architecture. It uses Qwen2.5-VL as a multimodal backbone, which can process multi-view images and natural language instructions simultaneously. At the same time, the built-in "action expert" branch combines the robot's own state in real-time, and outputs smooth and continuous control trajectories through conditional flow matching technology, ensuring the accuracy of dual-arm collaboration.
In addition, to address the weakness of traditional models in spatial depth perception, Ant Group introduced the LingBot-Depth spatial perception model. Through feature distillation technology, even when sensor data is missing, LingBot-VLA demonstrates excellent 3D spatial reasoning capabilities, performing particularly well in precise tasks such as stacking, inserting, and folding.

In the GM-100 real-world benchmark test containing 100 challenging tasks, the version of LingBot-VLA with depth perception achieved a success rate of 17.30%, significantly outperforming other similar models such as π0.5 and GR00T N1.6. Research also found that the model has extremely high data efficiency, requiring only about 80 demonstration data for specific tasks to quickly adapt to new robots.
Currently, Ant Group has officially open-sourced the complete training toolkit and model weights of LingBot-VLA. This toolkit is optimized for large-scale GPU clusters, with a training throughput 1.5 to 2.8 times higher than existing mainstream frameworks. This move will greatly reduce the development threshold for robotic large models and promote the penetration of embodied intelligence technology into more practical application scenarios.
Paper: https://arxiv.org/pdf/2601.18692
