The latest study released by the University of Chicago reveals significant differences in accuracy, reliability, and robustness among AI text detection tools available on the market. Some tools can almost perfectly distinguish between human- and AI-written texts, while others frequently misclassify, even failing to perform well on short texts. The study shows that the detector Pangram performed best among all tested systems, with high precision and cost-effectiveness.

Research Design: Covering Six Types of Texts and Four Major Large Models

The research team built a dataset containing 1992 human-written texts, covering six types: Amazon product reviews, blog posts, news reports, novel excerpts, restaurant reviews, and resumes. At the same time, they generated corresponding AI samples using four major language models — GPT-41, Claude Opus4, Claude Sonnet4, and Gemini2.0Flash.

Detection performance was measured through two core metrics:

  • False Positive Rate (FPR): the probability of misclassifying human texts as AI;

  • False Negative Rate (FNR): the proportion of AI texts that were not detected.

QQ20251103-101658.png

Pangram Leads, Open Source Detectors Perform Worst

The results show that Pangram nearly achieves zero false positives and zero false negatives in medium and long texts. Even in short texts, its error rate is below 0.01, with only a slight false positive of 0.02 in restaurant reviews generated by Gemini2.0Flash.

In contrast, OriginalityAI and GPTZero are in the second tier — they remain reliable for long texts (false positive rates remain below 0.01), but their accuracy significantly decreases in short samples and "humanized" texts.

Detectors based on the open-source RoBERTa model performed worst, misclassifying 30% to 69% of human texts as AI-generated, making them almost unusable in practice.

Detection Effect Varies by Generation Model

The study further points out that detection effectiveness is closely related to the type of AI model used.

  • Pangram can accurately identify texts generated by all four models, with a false positive rate always below 0.02;

  • OriginalityAI is more sensitive to Gemini2.0Flash, but less effective at detecting Claude series;

  • GPTZero is less affected by the model, but its overall accuracy still lags behind Pangram.

In long texts such as novels and resumes, the recognition rates of all detectors are generally high, while short reviews and brief messages are more challenging. Nevertheless, Pangram's full-alphabet syntax algorithm maintains an advantage in short-text identification.

Facing Evasion Tools: Pangram Demonstrates Robustness

To test anti-interference capability, researchers used StealthGPT, an evasion tool designed to make AI texts harder to detect. The results show that Pangram's detection performance remained almost unaffected, while other detectors saw a significant drop in accuracy.

In short-text scenarios with fewer than 50 words, Pangram showed the highest reliability, OriginalityAI often refused detection, and GPTZero had a significantly higher error rate than Pangram.

QQ20251103-101653.png

Cost and Strategy Control: Pangram Is More Practical

The study also calculated detection costs: the average cost for Pangram to correctly identify one AI text is only 0.0228 dollars, about half of OriginalityAI's and one-third of GPTZero's.

Additionally, the team introduced the concept of “Policy Cap” — allowing institutions to set a maximum acceptable false positive rate (e.g., 0.5%), and the system automatically calibrates the detector to meet this threshold.

Under this standard, Pangram is the only detector that can maintain high accuracy under a 0.5% false positive rate limit, while the performance of other detectors declined significantly.

Research Insights: The "Arms Race" Between Detectors and Models

Researchers point out that this confrontation is still in its early stages. With the continuous evolution of new generation models and "stealth" tools, the AI detection field will face a continuous technological arms race.

They suggest that institutions should regularly conduct "stress test"-style audits of detectors to ensure the system keeps up with the development of generative AI.

Moreover, the study emphasizes the sensitivity of detectors in real-world applications: AI can play a role in assisting creation, but replacing human originality in areas such as education, job hunting, or evaluation may raise ethical and authenticity issues.

Background and Industry Reflection

Previously, multiple studies have questioned the reliability of AI detectors. OpenAI briefly launched an official detection tool, but it was withdrawn due to low accuracy and has not released a new version since. Researchers speculate that OpenAI may not be eager for ChatGPT output to be easily detectable, as it could reduce usage among core user groups like students.

This study from the University of Chicago is therefore seen as one of the most systematic and quantified AI detection evaluations to date