In plain English
An LLM hallucinates when it states something false with total confidence — a fake court case, a made-up API method, a wrong birthday, a citation that doesn't exist. The answer sounds right. It's written in the same calm, fluent voice the model uses for true things. That's exactly what makes it dangerous.
Here's the everyday analogy. Imagine a student in an oral exam who never studied a particular topic but is graded only on whether they sound knowledgeable. Saying "I don't know" gets them nothing. Confidently guessing might get them partial credit. So they bluff — smoothly, plausibly, and sometimes wrongly. An LLM is that student, scaled up to a trillion exam questions.
The crucial thing for beginners: a model isn't looking up facts in a database. It predicts the next token — the most statistically plausible continuation of your text — over and over. "Plausible" and "true" usually line up. When they don't, you get a hallucination. The model never intended to lie, because it has no concept of true versus false in the first place. It only has likely versus unlikely.
Why it matters
If LLMs only ever wrote poetry, hallucination wouldn't matter. But people now use them for medical questions, legal research, code, and customer support — places where a confident wrong answer causes real harm. In 2023 two lawyers were sanctioned for filing a brief full of fake citations a chatbot invented. The cases looked perfectly real. They just didn't exist.
The hard part is that hallucinations are invisible from the inside of the answer. There's no spelling mistake, no broken grammar, no flashing red light. A fabricated function name sits right next to three real ones. This is why "just read the output carefully" is not a reliable defense — the failure mode is specifically designed, by the training process, to look correct.
- Developers ship code that calls methods which don't exist, or trust a summary that quietly misstates the source.
- Researchers and students get citations and statistics that are plausible but fabricated.
- Businesses put a chatbot in front of customers and discover it confidently invents refund policies.
- Everyone slowly learns to distrust all AI output — even the (frequently correct) parts — which wastes the technology's real value.
The good news, verified across 2026 benchmarks: rates have fallen sharply on factual-recall tests as models and tooling improved. The bad news: independent studies still find hallucinations in a large share of real-world interactions, and rates climb fast in specialized domains like law and medicine. Hallucination is reducible, not solved — and probably never fully solvable, for reasons we'll get to.
How it works
To see why hallucination is structural rather than a passing bug, follow what the model actually does at generation time. It does not retrieve. It predicts.
At each step the model produces a probability distribution over the whole vocabulary, and a sampling step picks one token. Nowhere in that loop is there a "is this factually true?" gate. If the training data made some plausible-but-wrong name highly likely, that name can win. For a deeper look at the loop itself, see how LLMs actually work.
Cause 1: pretraining can't avoid it
The 2025 paper Why Language Models Hallucinate (from OpenAI researchers, on arXiv) frames this cleanly. Even with perfectly clean training data, generating valid text is at least as hard as a simpler task: deciding whether a given statement is valid (they call it "Is-It-Valid"). For facts that appear rarely in the data — a specific person's birthday, an obscure award — the model has no reliable signal, so its error rate on those facts is bounded below by how often such facts are essentially singletons in training. In plain terms: for rare facts, the math guarantees some minimum rate of confident wrongness.
Cause 2: post-training rewards bluffing
Pretraining sets a floor; post-training and evaluation keep it from improving. Most benchmarks are graded like a multiple-choice exam: right answer = 1 point, wrong answer = 0, "I don't know" = 0. Under that scoring, a model that guesses on something it's unsure about will, on average, outscore a model that abstains — because a guess sometimes lands. Models are optimized against exactly these benchmarks, so they learn the test-taker's strategy: when uncertain, bluff confidently.
- Right when it knows → points
- Guesses when unsure → sometimes lucky → points
- Higher benchmark score
- More confident hallucinations
- Right when it knows → points
- Abstains when unsure → zero points
- Lower benchmark score
- Fewer hallucinations — but looks 'worse'
This is the key insight of the modern view: hallucination isn't only a data problem, it's an incentive problem. We built the scoreboard to reward confident guessing, and the models learned to play it. RLHF can make it worse too, when human raters prefer long, assertive answers over a humble "I'm not sure."
Fixes that actually move the needle
No single trick eliminates hallucination, but several stack up to large reductions. Here's the practical toolkit, roughly from cheapest to most involved.
| Fix | What it does | Effort |
|---|---|---|
| Give the model the source | Paste the doc/data into the prompt so it reads instead of recalls | Low |
| Permit abstention | Explicitly tell it to say 'I don't know' when unsure | Low |
| Lower temperature | Less random sampling for factual tasks | Low |
| RAG | Retrieve trusted docs at query time, ground the answer in them | Medium |
| Tool use / function calling | Let it call a calculator, search, or database for ground truth | Medium |
| Verification pass | A second LLM or human checks claims against sources | High |
The single biggest lever: grounding (RAG)
Retrieval-augmented generation flips the problem. Instead of asking the model to recall a fact from its frozen weights, you retrieve the relevant document at query time and put it in the context window, then ask the model to answer using only that text — ideally with citations you can click. Reading beats remembering. Industry reports in 2026 put RAG's hallucination reduction in the tens of percent on enterprise tasks.
Prompt-level moves anyone can use
- Explicitly allow 'I don't know.' A single line like "If you are not certain, say you don't know rather than guessing" measurably reduces fabrication. You're locally undoing the benchmark incentive.
- Ask for citations or quotes from a provided source. "Quote the exact sentence that supports this" makes ungrounded claims hard to fake.
- Lower the temperature for factual tasks so sampling favors the highest-probability (usually safest) tokens.
- Don't ask for what it can't know. Questions past the knowledge cutoff, or about niche private facts, are hallucination magnets — give it the data instead.
The mid-2026 landscape
Frontier labs now treat hallucination as a first-class metric, not an afterthought. As of mid-2026 the leading general models are Anthropic's Claude Opus 4.5 (released late 2025) and the Sonnet line, OpenAI's GPT-5.x family, and Google's Gemini 3 family with up-to-1M-token context. Public 2026 benchmark write-ups consistently rank the Claude models among the lowest hallucination rates on factual queries, though exact numbers vary wildly by benchmark and shouldn't be quoted as gospel.
Two shifts define the current moment:
The biggest conceptual change, echoed across 2026 research, is that calibration — does the model know what it doesn't know? — is now seen as the real frontier, not accuracy alone. A model that's right 95% of the time but gives no signal on the wrong 5% is more dangerous than one that's right 85% of the time and flags its own uncertainty. New training and evaluation methods (penalizing over- and under-confidence, scoring abstention as a valid answer) aim directly at the incentive problem the OpenAI paper named.
Products reflect this too: mainstream assistants now ground answers in live web search and show citations, agents call tools for anything computable, and serious deployments wrap models in guardrails and evals. The model is increasingly one component in a system designed to check it.
Going deeper
Once the basics click, a few subtler points separate people who understand hallucination from people who just fear it.
Hallucination vs. wrong-for-other-reasons
Not every wrong answer is a hallucination. If you ask about an event after the model's training cutoff, it's ignorant, not hallucinating — though it may then hallucinate to fill the gap. If it miscounts the letters in "strawberry," that's a tokenization quirk: the model never sees individual letters, only tokens, so it's a representation problem, not a fabrication. Diagnosing which failure you're hitting tells you which fix applies.
Do models know when they're wrong?
Partly. The probability the model assigns to its own tokens carries real signal — low-probability spans correlate with errors, and techniques like sampling several answers and checking if they agree (self-consistency) catch a meaningful fraction of confabulations. But a model can be fluently confident and wrong because its internal probability reflects textual plausibility, not truth. The two come apart exactly on rare facts. That's why external grounding beats asking the model to introspect.
Why 'just train a bigger model' won't fully fix it
Scaling reduces hallucination on facts that appear more often as data grows, but the floor for genuinely rare, one-off facts doesn't vanish — there's simply no signal to learn from. And as long as benchmarks reward guessing, bigger models also get better at confidently bluffing. The durable fix is socio-technical: change what we reward, build systems that retrieve and verify, and design evals that credit a well-placed "I don't know."
FAQ
Why does ChatGPT make up facts and sources?
Because it predicts the most plausible next token, not the most true one. It has no fact database to look things up in — it generates text that statistically resembles its training data. When a real fact is rare or absent, a plausible-but-fake version (a citation, a quote, a number) can be the highest-probability continuation, so the model writes it with full confidence.
Do LLMs know when they are wrong?
Only partially. The probability a model assigns to its own words carries some signal — low-confidence spans tend to be wrong more often — but that confidence reflects how plausible the text is, not whether it's true. On rare facts, a model can be completely confident and completely wrong. That's why external checks (RAG, tools, human review) work better than asking the model to grade itself.
How do I reduce LLM hallucinations in my app?
Stack several fixes: ground answers with retrieval (RAG) so the model reads a source instead of recalling, explicitly let it say 'I don't know,' lower the temperature for factual tasks, give it tools (search, calculator, database) for anything checkable, and add a verification pass for high-stakes output. Each one helps; together they cut hallucination substantially — though never to zero.
Is AI hallucination a bug that will be fixed?
It's better understood as a structural property than a bug. 2025–2026 research shows hallucination is partly guaranteed by how next-token prediction works on rare facts, and partly caused by benchmarks that reward confident guessing over honest abstention. Rates keep falling with better data, grounding, and calibration-aware training, but a complete fix isn't expected — manage it, don't wait for it to disappear.
What's the difference between a hallucination and a knowledge cutoff?
A knowledge cutoff means the model simply hasn't seen recent information — that's ignorance, not hallucination. Hallucination is when the model fabricates a confident answer anyway, often to fill that exact gap. The fixes differ: for stale knowledge, give the model fresh data (search/RAG); for hallucination on rare facts, ground every claim in a verifiable source.
Which AI models hallucinate the least in 2026?
Independent 2026 benchmarks vary a lot, but the Claude (Opus/Sonnet) line, GPT-5.x, and Gemini 3 are all frontier-class, with Claude models frequently ranking among the lowest hallucination rates on factual queries. Treat any single percentage as benchmark-specific — the bigger win comes from grounding the model with RAG and tools regardless of which one you pick.