Recently, the "CritPt" benchmark test, developed by more than 50 physicists worldwide, aims to evaluate the ability of top AI models to handle complex, unpublished physical research problems. The test simulates the level of independent research required of early PhD students. Although current AI systems such as Google's "Gemini3Pro" and OpenAI's "GPT-5" have been highly anticipated, the results were disappointing.

Image source note: The image is AI-generated, and the image licensing service provider is Midjourney
In an independent evaluation, Gemini3Pro ranked first with an accuracy rate of 9.1%, while GPT-5 followed closely with a score of 4.9%. These results indicate that even the best models still cannot solve most tasks, especially those involving more complex research challenges. The CritPt test covers 71 research challenges from 11 fields, including quantum physics, astrophysics, high-energy physics, and biophysics. To prevent models from simply guessing or retrieving, all questions are based on unpublished research content.
The testing team also adopted a stricter evaluation criterion called "continuous resolution rate," requiring models to provide correct answers at least four times out of five attempts. The results showed a significant drop in performance for all models, highlighting their fragility in reasoning about complex problems. This unreliability poses challenges for research workflows, as models often generate answers that seem correct but actually contain subtle errors, which may mislead researchers and increase the burden of review work.
The research team pointed out that current large models are still insufficient in independently solving open-ended physics problems, and a more realistic goal is to see them as "research assistants," helping in specific workflows. In line with this, OpenAI plans to launch a research intern system in September 2026 and a fully autonomous research system in March 2028. The company claims that GPT-5 is already helping researchers save time.
Key Points:
🌟 The performance of top AI models in complex physics tasks is unsatisfactory, with the highest accuracy reaching only 9.1%.
🔍 The "CritPt" benchmark test covers multiple physics fields, and all questions are based on unpublished research content.
🤖 Future AI is more likely to act as a research assistant rather than completely replacing human experts, helping to automate specific processes.
