Recently, a study from the University of Chicago revealed significant differences among various commercial AI text detection tools in the market. Researchers built a dataset containing 1992 human-written texts, covering six types: Amazon product reviews, blog posts, news reports, novel excerpts, restaurant reviews, and resumes. At the same time, they used four leading language models: GPT-41, Claude Opus4, Claude Sonnet4, and Gemini2.0Flash, to generate corresponding AI writing samples.
To compare the performance of these detection tools, the research team mainly tracked two metrics. The false positive rate (FPR) measures how often human texts are incorrectly marked as AI-generated, while the false negative rate (FNR) shows the proportion of AI texts that go undetected. In this direct comparison, the commercial detection tool Pangram performed excellently. For medium and long texts, Pangram's FPR and FNR were almost zero; for short texts, the error rate was also generally below 0.01, with the only exception being Gemini2.0Flash, which had an FNR of 0.02 in restaurant reviews.

Other detection tools like OriginalityAI and GPTZero performed slightly worse. Although they performed well on longer texts, keeping their FPR below 0.01, they were less satisfactory for extremely short texts. Additionally, they were more sensitive to "humanization" tools that disguise AI-generated texts as human writing.
Pangram showed excellent performance in identifying AI-generated texts, with FNRs for all four models not exceeding 0.02. In contrast, OriginalityAI's performance was more influenced by the generating model, while GPTZero showed more stable performance across models but still fell short of Pangram.
The researchers also tested the ability of each detection tool to counter the StealthGPT tool, which makes AI-generated texts harder to detect. Pangram performed relatively robustly in these situations, while other detection tools faced greater challenges.
In terms of economic efficiency, Pangram's average identification cost is $0.0228 per correctly identified AI text, approximately half of OriginalityAI and one-third of GPTZero. The study introduced the concept of a "policy ceiling," allowing users to set the maximum acceptable false positive rate to better regulate the detection tools.

The research team warned that these results are just a snapshot of the current situation, and a "arms race" will unfold between detection tools, new AI models, and evasion tools in the future. They recommended regular transparent audits to keep up with this rapidly changing field.
Project: https://pangram.ai/
Key Points:
🌟 Pangram excels in detection accuracy, with nearly zero false positives and false negatives.
📊 Other tools struggle with short texts, giving Pangram a clear advantage in identifying AI-generated texts.
💰 Pangram has the lowest identification cost, offering significant economic benefits and a practical choice for users.
