AI detection tools cannot prove that text is AI-generated
The runaway success of generative AI has spawned a billion-dollar sub-industry of “AI detection tools”: tools that purport to tell you if a piece of text was written by a human being or generated by an AI tool like ChatGPT. How could that possibly work?
I think these tools are both impressive and useful, and will likely get better. However, I am very worried about the general public overestimating how reliable they are. AI detection tools cannot prove that text is AI-generated.
Why AI detection is hard
My initial reaction when I heard about these tools was “there’s no way that could ever work”. I think that initial reaction is broadly correct, because the core idea of AI detection tools - that there is an intrinsic difference between human-generated writing and AI-generated writing - is just fundamentally mistaken0.
Large language models learn from huge training sets of human-written text. They learn to generate text that is as close as possible to the text in their training data. It’s this data that determines the basic “voice” of an AI model, not anything about the fact that it’s an AI model. A model trained on Shakespeare will sound like Shakespeare, and so on. You could train a thousand different models on a thousand different training sets without finding a common “model voice” or signature that all of them share.
We can thus say (almost a priori) that AI detection tools cannot prove that text is AI-generated. Anything generated by a language model is by definition the kind of thing that could have been generated by a human.
Why AI detection tools might work anyway
But of course it’s possible to tell when something was written by AI! When I read Twitter replies, the obviously-LLM-generated ones stick out like a sore thumb. I wrote about this in Why does AI slop feel so bad to read?. How can this be possible, when it’s impossible to prove that something was written by AI?
Part of the answer here might just be that current-generation AI models have a really annoying “house style”, and any humans writing in the same style are annoying in the same way. When I read the first sentence of a blog post and think “oh, this is AI slop, no need to keep reading”, I don’t actually care whether it’s AI or not. If it’s a human, they’re still writing in the style of AI slop, and I still don’t want to read the rest of the post.
However, I think there’s more going on here. Claude does kind of sound like ChatGPT a lot of the time, even though they’re different models trained in different ways on (at least partially) different data. I think the optimistic case for AI detection tooling goes something like this:
- RLHF and instruction/safety tuning pushes all strong LLMs towards the same kind of tone and style
- That tone and style can be automatically detected by training a classifier model
- Sure, it’s possible for technically-sophisticated users to use abliterated LLMs or less-safety-tuned open models, but 99% of users will just be using ChatGPT or Claude (particularly if they’re lazy enough to cheat on their essays in the first place)
- Thus a fairly simple “ChatGPT/Claude/Gemini prose style detector” can get you 90% of the way towards detecting most people using LLMs to write their essays
I find this fairly compelling, so long as you’re okay with a 90% success rate. A 90% success rate can be surprisingly bad if the base rate is low, as illustrated by the classic Bayes’ theorem example. If 10% of essays in a class are AI-written, and your detector is 90% accurate, then only half of the essays it flags will be truly AI-written. If an AI detection tool thinks a piece of writing is AI, you should treat that as “kind of suspicious” instead of conclusive proof.
How do AI detection tools work?
There are a few different approaches to building AI detection tools. The naive approach - which I couldn’t find any actual production examples of - would be to train a simple text classifier on a body of human-written and AI-written text. Apparently this doesn’t work particularly well. The Ghostbuster paper tried this and decided that it was easier to train a classifier on the logits themselves: they pass each candidate document through a bunch of simple LLMs, record how much each LLM “agreed” with the text, then train their classifier on that data. DNA-GPT takes an even simpler approach: they truncate a candidate document, regenerate the last half via frontier LLMs, and then compare that with the actual last half.
The most impressive paper I’ve seen is the EditLens paper by Pangram Labs. EditLens trains a model on text that was edited by AI to various extents, not generated from scratch, so the model can learn to predict the granular degree of AI involvement in a particular text. This plausibly gets you a much better classifier than a strict “AI or not” classifier model, because each example teaches the model a numeric value instead of a single bit of information.
One obvious point: all of these tools use AI themselves. There is simply no way to detect the presence of AI writing without either training your own model or running inference via existing frontier models. This is bad news for the most militantly anti-AI people, who would prefer not to use AI for any reason, even to catch other people using AI. It also means that - as I said earlier and will say again - AI detection tools cannot prove that text is AI-generated. Even the best detection tools can only say that it’s extremely likely.
Humanizing tools
Interestingly, there’s a sub-sub-industry of “humanizing” tools that aim to convert your AI-generated text into text that will be judged by AI detection tools as “human”. Some free AI detection tools are actually sales funnels for these humanizing tools, and will thus deliberately produce a lot of false-positives so users will pay for the humanizing service. For instance, I ran one of my blog posts1 through JustDone, which assessed them as 90% AI generated and offered to fix it up for the low, low price of $40 per month.
These tools don’t say this outright, but of course the “humanizing” process involves passing your writing through a LLM that’s either prompted or fine-tuned to produce less-LLM-sounding content. I find this pretty ironic. There are probably a bunch of students who have been convinced by one of these tools to make their human-written essay LLM-generated, out of (justified) paranoia that a false-positive would get them in real trouble with their school or university.
False positives and social harm
It is to almost everyone’s advantage to pretend that these tools are better than they are. The companies that make up the billion-dollar AI detection tool industry obviously want to pretend like they’re selling a perfectly reliable tool. University and school administrators want to pretend like they’ve got the problem under control. People on the internet like dunking on people by posting a screenshot that “proves” they’re copying their messages from ChatGPT.
Even the AI labs themselves would like to pretend that AI detection is easy and reliable, since it would relieve them of some of the responsibility they bear for effectively wrecking the education system. OpenAI actually released their own AI detection tool in January 2023, before retiring it six months later due to “its low rate of accuracy”.
The real people who suffer from this mirage are the people who are trying to write, but now have to deal with being mistakenly judged for passing AI writing off as their own. I know students who are second-guessing how they write in order to sound “less like AI”, or who are recording their keystrokes or taking photos of drafts in order to have some kind of evidence that they can use against false positives.
If you are in a position where you’re required to judge if people are using AI to write their articles or essays, I would urge you to be realistic about the capabilities of AI detection tooling. They can make educated guesses about whether text was written by AI, but that’s all they are: educated guesses. That goes double if you’re using a detection tool that also offers a “humanizing” service, since those tools are incentivized to produce false positives.
AI detection tools cannot prove that text is AI-generated.
-
People sometimes talk about watermarking: when a provider like OpenAI deliberately trains their model to output text in some cryptographic way that leaves a very-hard-to-fake fingerprint. For instance, maybe it could always output text where the frequency of “e”s divided by the frequency of “l”s approximates pi. That would be very hard for humans to copy! I suspect there’s some kind of watermarking going on already (OpenAI models output weird space characters, which might trip up people naively copy-pasting their content) but I’m not going to talk about it in this post, because (a) sophisticated watermarking harms model capability so I don’t think anyone’s doing it, and (b) unsophisticated watermarking is easily avoided.
↩ -
I write every one of these posts with my own human fingers.
↩
If you liked this post, consider subscribing to email updates about my new posts, or sharing it on Hacker News. Here's a preview of a related post that shares tags with this one.
Why it takes months to tell if new AI models are good
Nobody knows how to tell if current-generation models are any good. When GPT-5 launched, the overall mood was very negative, and the consensus was that it wasn’t a strong model. But three months later it turns out that GPT-5 (and its derivative GPT-5-Codex) is a very strong model for agentic work: enough to break Anthropic’s monopoly on agentic coding models. In fact, GPT-5-Codex is my preferred model for agentic coding. It’s slower than Claude Sonnet 4.5, but in my experience it gets more hard problems correct. Why did it take months for me to figure this out?
Continue reading...