Unlocking 3D Vision for Robots: Yueli Lingji Introduces the GeoVLA Framework, Revolutionizing Traditional VLA Models!

Today, with the rapid development of artificial intelligence and robot technology, vision-language-action (VLA) models are widely regarded as key to building general-purpose robots. However, many existing VLA models (such as OpenVLA, RT-2, etc.) reveal a serious shortcoming when dealing with complex unstructured environments: spatial blindness. They rely on 2D RGB images as visual input, which limits the model's performance in 3D space and makes it difficult to accurately judge the depth and position of objects.

To address this issue, the research team at Yuanli Lingji has introduced a new VLA framework — GeoVLA. This framework maintains the strong pre-training capabilities of existing vision-language models (VLMs) while adopting an innovative dual-stream architecture. Specifically, GeoVLA introduces a dedicated point cloud embedding network (PEN) and a spatial-aware action expert (3DAE), giving the robot true 3D geometric perception capabilities. This design not only achieves leading performance in simulation environments but also demonstrates excellent robustness in real-world testing scenarios.

The core logic of GeoVLA lies in decoupling tasks: letting the VLM handle "understanding what it is," while the point cloud network handles "knowing where it is." This new end-to-end framework involves the collaborative work of three key components: the semantic understanding stream, the geometric perception stream, and the action generation stream. This approach enables the model to perform tasks more accurately.

In a series of experiments, GeoVLA demonstrated significant advantages. In the LIBERO benchmark test, GeoVLA achieved a success rate of 97.7%, surpassing previous SOTA models. Additionally, in more complex physical simulation tests such as ManiSkill2, GeoVLA performed exceptionally well, maintaining a high success rate especially when handling complex objects and changes in perspective.

More impressively, GeoVLA shows robustness in out-of-distribution scenarios, proving its strong adaptability in dealing with various uncertainties and changing conditions. This breakthrough will open up new possibilities for future robot applications and drive intelligent robot technology to a higher level.

Project URL: https://linsun449.github.io/GeoVLA/

Unlocking 3D Vision for Robots: Yueli Lingji Introduces the GeoVLA Framework, Revolutionizing Traditional VLA Models!

Related Recommendations

OpenAI Plans to Raise Up to $100 Billion in Funding, Valuation Could Reach $830 Billion

Xiaohongshu Open Sources InstanceAssemble! A Lightweight Layout-Controlled Generation Framework, Further Breaking the Accuracy of Complex Multi-Instance Image Generation

The Battle for AI Talent Intensifies: Tech Giants Like OpenAI and Google Offer High Salaries to Interns

5-Minute Mastery: Study Finds Humans Can Detect AI-Generated Faces Through Targeted Training

Great Wall Motor Upgrades Fully! The Haval Manglong Will Introduce Urban NOA Smart Driving for the First Time, Accelerating Popularization