New Benchmark for AI Evaluation! GPT-5 and Other Cutting-Edge Models Score Zero Points. What Is the Level of Doctor-Level Reasoning?

Recently, a new AI evaluation benchmark called FormulaOne has attracted widespread attention. This benchmark, introduced by the research institution AAI, which focuses on superintelligence and advanced AI systems, challenged a number of top AI models such as GPT-5, Grok4, and o3Pro. The results were astonishing: all these models scored zero in the tests!

The FormulaOne benchmark includes 220 novel graph-structured dynamic programming problems, with difficulty levels ranging from moderate to research-level. These include complex areas such as topology, geometry, and combinatorics. Although the problem statements may seem simple, solving them requires highly complex reasoning and logical deduction, almost reaching the level of a doctoral challenge.

This series of problems relies on an algorithmic meta-theorem proposed by Courcelle, which emphasizes that for each tree-like graph, any problem definable in logic can be solved using dynamic programming algorithms. This requires the use of a structure known as tree decomposition, which organizes the vertices of a graph into a series of overlapping sets arranged in a tree-like structure, and then solves them step by step using dynamic programming.

On shallow-level problems, these cutting-edge AI models performed reasonably well, achieving success rates between 50% and 70%, indicating that they have some understanding of these types of problems. However, the situation is not so optimistic for deeper and more challenging questions. In the deep-level tests, the success rate of top models dropped significantly, with models like Grok4 and Gemini-Pro managing to solve only 1% of the questions, while GPT-5Pro performed slightly better, solving only four questions. In the most profound level of testing, all models had a success rate of zero, resulting in a collective failure.

This evaluation result has sparked wide discussion in the academic community and raised doubts about the true capabilities of AI models. Many people even suggested that human PhD students should also participate in the assessment. With the rapid development of AI technology, we must ask ourselves: how far are these models from truly achieving "doctoral-level" reasoning abilities?

Model address: https://huggingface.co/spaces/double-ai/FormulaOne-Leaderboard

Key points:

✅ AI models such as GPT-5 scored zero in the new evaluation benchmark FormulaOne, which is shocking!

✅ FormulaOne contains 220 high-difficulty dynamic programming problems, testing the reasoning ability of AI models.

✅ Most models performed reasonably well on shallow problems, but failed on deeper and more challenging problems, revealing the limitations of AI.

New Benchmark for AI Evaluation! GPT-5 and Other Cutting-Edge Models Score Zero Points. What Is the Level of Doctor-Level Reasoning?

Related Recommendations

OpenAI Says Goodbye to GPT-4o: Officially Discontinued on February 13, 2026

ChatGPT Handles 2 Million Insurance Problems Weekly, Medical Inquiries Exceed 5%

ChatGPT Handles 2 Million Insurance Issues Weekly, Medical Inquiries Account for Over 5%

China's Open-Source Large Model Rises to the Top Globally! Qianwen Downloads Exceed Those of the US OpenAI, Which Is Accused of Being Closed and Slowing Down. Wired: AI Value Lies in Ecosystem Rather Than Intelligence

GPT-5 First Successfully Solves a Mathematical Problem Independently, Transparent Attribution Experiment Sparks Academic Debate