Google DeepMind has recently released SIMA2, aimed at testing the performance of general agents in complex 3D game worlds. SIMA2 (Scalable Instructionable Multi-World Agent) is an upgraded version that uses the Gemini model, allowing it to better understand goals, explain plans, and continuously improve in different environments through self-learning.

SIMA1, the predecessor of SIMA2, was launched in 2024. At that time, it controlled games by rendering images and using a virtual keyboard and mouse, learning over 600 language instructions, with a task completion rate of about 31%, while human players achieved around 71%. SIMA2 retains the same interface but uses Gemini2.5Flash Lite as the core reasoning engine. This makes SIMA2 not only an instruction executor but also a gaming partner that interacts with players.

The architecture of SIMA2 integrates Gemini as a core component, receiving visual observations and user instructions to derive high-level goals and generate corresponding actions. This new training mode enables the agent to explain its intentions, answer questions about current goals, and demonstrate its reasoning process regarding the environment. In DeepMind's evaluation, SIMA2's task completion rate increased to 62%, nearly reaching the level of human players.
SIMA2 also expands the instruction channels, not only understanding text instructions but also handling voice, graphics, and even emojis. In one demonstration, the user asked SIMA2 to find a "house that is the color of a ripe tomato," and it reasoned that "a ripe tomato is red" and successfully located the target.
Self-improvement is also a major highlight of SIMA2. In the initial stage, after using human gameplay demonstrations, the agent enters a new game and learns entirely based on its own experience. The Gemini model generates new tasks and scores them for the agent, enabling subsequent versions to succeed in many previously failed tasks without additional human demonstrations.
Finally, DeepMind combined SIMA2 with Genie3 to generate interactive 3D environments from a single image or text prompt, demonstrating how the agent can identify objects and complete specified tasks in new environments. This marks an important step toward more advanced real-world robots for a general agent.
Official blog: https://deepmind.google/blog/sima-2-an-agent-that-plays-reasons-and-learns-with-you-in-virtual-3d-worlds/
Key points:
🌟 SIMA2 integrates the Gemini2.5Flash Lite model, giving the agent higher reasoning and planning capabilities.
📈 SIMA2's task completion rate has increased to 62%, close to the level of human players, showing significant performance improvements.
🛠️ Through self-improvement mechanisms and Genie3 environment generation, SIMA2 demonstrates adaptability and generality in new scenarios.
