Today, with the rapid development of artificial intelligence and robot technology, vision-language-action (VLA) models are widely regarded as key to building general-purpose robots. However, many existing VLA models (such as OpenVLA, RT-2, etc.) reveal a serious shortcoming when dealing with complex unstructured environments: spatial blindness. They rely on 2D RGB images as visual input, which limits the model's performance in 3D space and makes it difficult to accurately judge the depth and position of objects.

image.png

To address this issue, the research team at Yuanli Lingji has introduced a new VLA framework — GeoVLA. This framework maintains the strong pre-training capabilities of existing vision-language models (VLMs) while adopting an innovative dual-stream architecture. Specifically, GeoVLA introduces a dedicated point cloud embedding network (PEN) and a spatial-aware action expert (3DAE), giving the robot true 3D geometric perception capabilities. This design not only achieves leading performance in simulation environments but also demonstrates excellent robustness in real-world testing scenarios.

The core logic of GeoVLA lies in decoupling tasks: letting the VLM handle "understanding what it is," while the point cloud network handles "knowing where it is." This new end-to-end framework involves the collaborative work of three key components: the semantic understanding stream, the geometric perception stream, and the action generation stream. This approach enables the model to perform tasks more accurately.

image.png

In a series of experiments, GeoVLA demonstrated significant advantages. In the LIBERO benchmark test, GeoVLA achieved a success rate of 97.7%, surpassing previous SOTA models. Additionally, in more complex physical simulation tests such as ManiSkill2, GeoVLA performed exceptionally well, maintaining a high success rate especially when handling complex objects and changes in perspective.

More impressively, GeoVLA shows robustness in out-of-distribution scenarios, proving its strong adaptability in dealing with various uncertainties and changing conditions. This breakthrough will open up new possibilities for future robot applications and drive intelligent robot technology to a higher level.

Project URL: https://linsun449.github.io/GeoVLA/