The submission went in on a Tuesday.
Formal prose. Technical subject. Revised three times. The kind of writing that takes longer than it looks.
The editor came back Thursday. The ai checker said 100%. Policy required disclosure or rejection.
I wrote every word of it...
You submit something to a publication. The editor runs it through an ai checker. The tool returns a number. The number is high – 87%, 94%, 100%. The editor asks you about it. You explain that you wrote it yourself, or mostly yourself, or with some structural assistance, or that you wrote it in 2019 before any of this existed. The number does not change. The policy applies.
This is how AI content detection currently works. The tool produces a score. The score is treated as evidence. The evidence is wrong with a frequency that would disqualify it from any serious research context. Nobody in the decision chain is required to verify this. The policy stands.
The Score Is a Number. What the AI Checker Measures Is Something Else.
AI checkers do not detect AI. That would require knowing what AI-generated text actually is, which turns out to be a surprisingly hard problem. What they detect instead is a set of statistical patterns associated with AI training outputs – uniform sentence rhythm, formal register, technical vocabulary, low perplexity in word choice, consistent paragraph length. These patterns appear in AI-generated text because AI models learned them from human writing. They also appear in human writing because that is what formal prose looks like.
The result is a tool that flags not AI, but a certain kind of writing. Academic papers. Technical documentation. Edited journalism. Blog posts that have been revised more than once. Articles written by non-native speakers who default to formal constructions. And, occasionally, writing published in 1999 that predates the tools entirely and still scores at 94%.
The percentage is presented as a confidence score. It is not. It is a similarity score – how closely the text resembles the statistical distribution of content in the training data. The confidence is the tool’s in its own methodology, not in the conclusion it’s asking you to draw. That distinction is not communicated to the editors using it to make decisions.
This is worth understanding before getting into the specific tools, because the problem isn’t implementation. It’s the premise. A well-implemented version of the same approach would produce the same false positives. AI model collapse is already a documented problem in other contexts. Here the problem is the tool being trusted for a job it wasn’t designed to do.
Pangram Said 100%. ZeroGPT Agreed. GPTZero Wasn’t Sure.
There are three detection tools that most publishers are currently using or citing when they flag content.
Pangram AI is the most aggressive. It returns high confidence scores on content that other tools assess differently, and it has a documented tendency to flag formal technical writing regardless of origin. A 100% score from Pangram communicates certainty. It is not certain. It is a high similarity reading from a tool with no published false positive rate, applied as though it were a forensic test. It has returned scores above 90% on passages from books published before GPT existed. Nobody involved found this worth addressing. Editors who use Pangram exclusively are making editorial decisions based on a number with no disclosed methodology and no published accuracy rate.
ZeroGPT runs at a similar confidence level with similar reliability problems. It flagged my friend’s researcher grant application at 76% AI-generated. I know she’d written it over three weeks. The grant committee was not interested in the nuance. The false positive rate on human-written technical content is high enough that using it as a primary filter is not a quality control measure – it’s a coin flip with extra steps. Using it as a standalone assessment is a choice, not a necessity.
GPTZero is more or less fine – which in this context means it is somewhat less wrong than the others. It produces lower confidence scores on ambiguous content, flags uncertainty more explicitly, and has at least some published research behind its approach. It is not reliable. It is just the least unreliable option currently available, which is a low bar.
If you use UX/UI design deliverables in your practice – audits, case studies, proposals – the same gap between what a tool claims to measure and what it actually measures should be familiar. The AI research workflow has the same problem in the other direction: a tool presented as definitive that is doing something much noisier than the interface suggests.
The Policy Doesn’t Check the AI Checker
Publishers adopted AI detection policies because AI-generated content became a real problem worth addressing. That part is fair. The volume of low-effort synthetic content flowing into editorial pipelines in 2023 and 2024 was significant, and doing nothing was not a reasonable option.
What they did instead was adopt the available tools without validating them. The policy says: run submissions through these tools, flag high scores, require disclosure or rejection. The policy does not say: verify the tool’s false positive rate on the type of content we publish, establish a threshold that accounts for the tool’s known limitations, or treat a high score as a signal to investigate rather than a verdict.
The result is editorial infrastructure that functions as though these tools are accurate, while the tools themselves have no obligation to be. A high score from Pangram is not evidence of AI generation. It is evidence that the text scored high on Pangram. These are not the same thing, and the gap between them is where writers who draft in formal registers, work in technical fields, or revise their prose carefully get caught.
The problem is not that publications want to address AI-generated content. That instinct is correct.
The problem is that the tools being used to operationalise that instinct are producing false positives that no one in the policy chain is accountable for.
The tool flags it.
The policy applies.
The writer explains.
The score doesn’t change.
The next submission from a different writer gets flagged the same way.
This is not a learning system.
AI in design has always required a layer of human judgment to be useful rather than harmful. The same is true here, and it is currently missing. What’s in place instead is automated suspicion dressed up as verification.
Nothing About This Is Going to Change Soon
The economics are straightforward. Building a more accurate detection tool requires more investment, more transparency about methodology, and a published false positive rate that would make the tool look worse than competitors who don’t publish theirs. None of those things are incentivised.
Publishers adopting better evaluation practices would require acknowledging that their current process is producing incorrect flags, which would require dealing with the writers those flags affected. That conversation is also not incentivised.
The tools will get marginally better. Mostly this means the interface will improve – a confidence breakdown by paragraph, a colour-coded heatmap, a shareable report. The score will still be wrong at roughly the same rate. The policies will persist. Formal prose will continue to score high. Writers who work carefully will continue to get flagged. The percentage will continue to be treated as evidence, and somewhere a product team will ship a v2 with a pie chart.
For UX design practitioners who produce written work – audits, research reports, client documentation – this is the new baseline. Carefully written, technically precise, revised until it’s right. Exactly the profile these tools are trained to flag. The UX research vs design gap has nothing on the gap between what an ai checker claims to measure and what it actually measures. At least in research, everyone knows the findings get ignored. Here, everyone assumes the findings are correct.
