Research by the METR institution shows that the SWE-bench Verified benchmark, widely used to evaluate AI programming capabilities, may significantly overestimate the performance of AI agents in real software development environments. The study found that about half of the AI code solutions deemed "passed" in the benchmark would be rejected during actual review by project maintainers, indicating a significant gap between automated evaluation results and real-world code quality.

SWE-bench Verified has long been considered one of the key evaluation standards for AI-assisted software engineering, used to test whether models can solve real programming problems in open-source projects and verify whether code changes pass the project's test suite through automation. Several AI companies, including Anthropic and OpenAI, often cite this benchmark to demonstrate model progress.

QQ20260312-093454.jpg

In this study, the METR team invited four experienced developers who maintain the open-source projects scikit-learn, Sphinx, and pytest to conduct manual reviews of 296 pieces of AI-generated code. These codes came from solutions generated by five models, including Claude3.5Sonnet, Claude3.7Sonnet, Claude4Opus, Claude4.5Sonnet, and GPT-5. The results showed that the actual adoption rate by maintainers was on average about 24 percentage points lower than the automated scores from SWE-bench, a difference that is statistically significant.

The study also found that the rejected AI code was not primarily due to style issues but rather more substantial engineering flaws. Maintainers categorized the issues into three types: code quality not meeting project specifications, disruption of existing code structure, and fundamental functional errors. A considerable number of cases involved functional errors, where even though the automated tests passed, the code did not truly fix the problem.

In terms of model comparison, the study found that upgrading from Claude3.5Sonnet to Claude3.7Sonnet significantly improved the benchmark pass rate, but the number of functional errors marked by maintainers also increased; from Claude3.7 to Claude4Opus, the issues shifted more towards code quality, while Claude4.5Sonnet showed improvement in code quality. In contrast, GPT-5 performed significantly worse overall compared to the Anthropic series models in this assessment.

Artificial intelligence brain, large model

The research team also conducted an estimated analysis of the "task time span": according to the SWE-bench automated evaluation results, it would take about 50 minutes of human effort for Claude4.5Sonnet to complete tasks with a 50% success rate, while according to the maintainers' scores, it would take only about 8 minutes, meaning the benchmark may overestimate capabilities by up to about 7 times.

However, the researchers also emphasized that this study does not mean AI programming agents have a fundamental capability limit. With better prompting strategies, more human feedback, or multiple iterations, the gap between automated evaluation and manual review could still be narrowed. Additionally, the experimental environment differs from real development processes, such as AI agents having only one submission opportunity, whereas human developers usually can continuously modify code based on feedback.

In summary, the study points out that relying solely on benchmark scores to assess the practical utility of AI programming agents may lead to systematic bias. As AI coding models rapidly evolve, how to build an evaluation system closer to real development environments has become an important research direction in the field of AI software engineering.