Amid the intensifying global competition in artificial intelligence, Shanghai Jiao Tong University and the DeepSeek team have successfully achieved an impressive score of 32.1 in a test known as "The Last Exam" (HLE), making the first breakthrough above 30 points. This test set is renowned for its extremely high difficulty, with no model previously scoring above 10 points. Even recently, the highest score was only 26.9, jointly set by Kimi-Research and Gemini Deep Research.
This research introduced X-Master, a tool-enhanced reasoning agent, and the multi-agent workflow system X-Masters. This solution not only performs well technically but also has been open-sourced to further promote collaboration and development in the AI field.
The core concept of X-Master is to simulate the dynamic process of human researchers solving problems, seamlessly switching between internal reasoning and external tools. When encountering unsolvable problems, X-Master will write action plans into code, execute this code through various tools (such as NumPy and SciPy), and integrate the results back into the agent's knowledge system. This process creates an efficient feedback loop, allowing the agent to continuously optimize its reasoning process.
X-Masters is designed to be more complex, using a distributed-stacked agent workflow that enhances the breadth and depth of reasoning. During the distributed phase, multiple solvers work in parallel, generating different solutions, while a critic agent evaluates and improves these solutions. Next, a rewriter agent consolidates all outputs into a better solution, and finally, a selector agent chooses the best answer.
In this test, X-Masters also performed exceptionally well in the biology/medicine category, surpassing existing agent systems, demonstrating its strong capabilities in tackling complex problems.
"The Last Exam" was initiated this year by the AI Safety Center and Scale AI, aiming to assess the intelligence level of AI systems. The questions come from over 1,000 scholars at more than 500 institutions, and the difficulty is quite high.