According to AIbase, a new physical benchmark test called "CritPt" has shown that even the most advanced artificial intelligence models, such as Gemini3Pro and GPT-5, still have a long way to go before becoming true autonomous scientists. This benchmark test aims to rigorously evaluate leading AI models at the level of early PhD research.

CritPt: Testing AI's Practical Research Capabilities

"CritPt" was jointly developed by more than 50 physicists from over 30 institutions around the world. Its core goal goes beyond testing memory of textbook knowledge, but rather tests whether AI has the ability to solve original, unpublished research problems - equivalent to the independent work level of a capable physics graduate student.

To ensure the rigor of the test and prevent cheating, all 71 complete research challenges in CritPt are based on unpublished materials, covering 11 cutting-edge fields such as quantum physics, astrophysics, high-energy physics, and biophysics. The research team further divided these challenges into 190 smaller "checkpoints" to measure the model's progress in solving complex problems.

Robot Artificial Intelligence AI (4)

Disheartening Initial Results: Top Models Have Less Than 10% Accuracy

The preliminary results of the test were very sobering. According to an independent assessment by the AI analysis company (Artificial Analysis), even the most powerful systems failed to complete most tasks:

  • The Google "Gemini3Pro Preview" had an accuracy rate of only 9.1%. (Notably, it used 10% fewer tokens than the second place.)

  • The second-place OpenAI "GPT-5.1 (high)" had an accuracy rate of only 4.9%.

The research results cruelly reveal that current large language models generally lack the necessary rigor, creativity, and precision when facing open-ended physics problems. Although the models showed some progress on simpler, well-defined "checkpoint" sub-tasks, they were helpless when facing complete research challenges.

Core Obstacle: Fragile Reasoning Ability

The research team introduced a more stringent metric - "consistent resolution rate" (requiring at least four correct attempts out of five) - to test the model's stability. Under this metric, the model's performance dropped significantly across the board.

This lack of robustness poses serious challenges for actual research workflows. The models often produce seemingly reasonable results, but hidden within them are subtle errors that are difficult to detect, which can mislead researchers and require experts to spend a lot of time reviewing and checking.

Future Outlook: From Scientist to Research Assistant

Based on the CritPt test results, the researchers believe that in the foreseeable future, a more realistic goal is not to replace human experts with "AI scientists," but to use AI as a "research assistant" to automate specific workflow steps.

This view aligns with current industry planning: OpenAI claims that GPT-5 has begun to save time for researchers and plans to launch a research intern system by September 2026, aiming to introduce a fully autonomous researcher system by March 2028. However, the results of CritPt indicate that AI still needs to bridge a huge technical gap to achieve this ultimate goal.