Meta's new model helps robots manipulate objects in unknown environments

AIbase基地

Published in AI News · 5 minute read · Jul 11, 2025

Meta's recently launched V-JEPA2 model marks a significant breakthrough in the field of artificial intelligence, particularly in its applications for robotics. Although large language models (LLMs) excel in text processing, they still lack sufficient "common sense" in dynamic real-world environments, which limits their applications in fields such as manufacturing and logistics. Therefore, the emergence of V-JEPA2 provides a new approach to addressing this issue.

V-JEPA2 builds a "world model" by learning from videos and physical interactions. This enables AI applications to make predictions and plans in changing environments, laying the foundation for smarter robots and advanced automation. Compared to traditional models, V-JEPA2 adopts a video joint embedding prediction architecture, with its core focus on understanding objects in scenes, predicting behavioral changes, and planning action sequences to achieve specific goals.

The training of this model is divided into two stages. In the first stage, V-JEPA2 self-learns physical laws and builds basic knowledge by watching over one million hours of unlabeled videos. In the second stage, it undergoes fine-tuning through 62 hours of robot operation videos and corresponding control instructions to ensure the model can associate specific actions with physical outcomes. Thanks to this two-stage training, V-JEPA2 has achieved "zero-shot" robot planning capabilities, allowing it to manipulate unfamiliar objects in entirely new environments without additional training.

Specifically, when a robot is given a target image, it uses the V-JEPA2 model for internal simulations, evaluates a series of possible next steps, and selects the optimal execution to complete the task. This method achieves a success rate of 65% to 80% when handling unfamiliar objects.

V-JEPA2 has broad application prospects, especially in logistics and manufacturing. It allows robots to quickly adapt to changes in product layouts and warehouses without requiring extensive reprogramming. This is of great significance for enterprises exploring the deployment of humanoid robots in factories and assembly lines. Additionally, V-JEPA2 can also drive highly realistic digital twin technology, helping companies simulate new processes or train other AI systems in physically accurate virtual environments.

Meta hopes that by releasing the V-JEPA2 model and its training code, the community can work together to advance, achieving its long-term goal of developing AI systems that can understand the world, plan, and execute unfamiliar tasks like humans do.

Project: https://ai.meta.com/vjepa/

Key points:

🔍 The V-JEPA2 model builds a "world model" by observing videos and physical interactions, enhancing a robot's operational capabilities in dynamic environments.

🤖 The model supports "zero-shot" robot planning, enabling robots to manipulate unfamiliar objects in new environments without additional training.

📈 V-JEPA2 has broad application prospects, improving the adaptability of robots in logistics and manufacturing while reducing the need for reprogramming.

Manus AI Chat Mode is now free! Seamless switch to Agent mode with Gemini integration revolutionizes productivity

On June 12, Manus AI announced the launch of its new Chat Mode, which is completely free for all users and has no usage restrictions. This release marks another major breakthrough for Manus in the field of AI productivity tools, providing users with a seamless experience ranging from simple conversations to complex tasks. Image source note: The image is generated by AI, and the image license service provider Midjourney offers it for free without restrictions; Chat Mode lowers the threshold for use. The latest Chat Mode launched by Manus AI provides users with simplicity...

Integration of Imagen 4 into Gemini: Free Access to Professional-Level Image Creation

, Google announced that its latest image generation model, Imagen4, has been officially integrated into the Gemini platform, bringing users enhanced capabilities for image creation. According to recent online information, Imagen4 has achieved significant breakthroughs in image details, text rendering, and color performance, making it one of the leading technologies in the field of AI image generation. This article will comprehensively analyze the far-reaching impact of Imagen4's integration on the Gemini ecosystem from technical highlights, functional applications, and user feedback perspectives. Technical Breakthroughs: Crisper, Smarter

OpenAI Upgrades ChatGPT Projects: In-depth Research + Voice Mode

OpenAI announced major updates to the ChatGPT Projects feature during its '12 Days of OpenAI' event, providing users with a more powerful project management and AI interaction experience. As one of the core functions of ChatGPT, Projects aim to help users organize and manage AI-driven workflows more efficiently. This update introduces several innovative features, including in-depth research, voice mode, improved memory functions, and enhanced mobile support, marking a significant milestone for ChatGPT.

Imagen 4 Lands on Gemini! Chat Turns into Gallery, AI Image Generation Enters a New Era

The Gemini platform from Google has received a major update, officially integrating the latest Imagen4 image generation model. This upgrade allows users to directly generate high-quality images through simple prompts in chat conversations, marking a new stage for AI image generation technology where it becomes more intuitive and convenient. With Imagen4: A Leap in Image Generation Quality The Gemini platform now fully supports Imagen4, which is Google's latest text-to-image generation model, achieving significant improvements compared to its predecessor, Imagen3.

Luo Yonghao's Digital Person Live Stream to Debut on Baidu E-commerce: Exploring a New AI+IP Sales Model

Well-known e-commerce host Luo Yonghao announced via Weibo on June 12 that his digital persona will start live-streaming for sales on the Baidu e-commerce platform on June 15, inviting old friends to come and serve as "human verification tools" at that time. This has drawn attention from the industry. Previously, Luo Yonghao conducted his first live stream on Baidu e-commerce on May 23, breaking RMB 50 million in GMV (Gross Merchandise Value) within 4 hours with over ten million viewers, demonstrating his strong appeal. It is understood that Luo Yonghao is the first top host to try live-streaming sales using a digital person.