AI/TLDR

Detecting Hallucinations in Production: Practical Signals and Checks

You'll understand the practical, runtime techniques teams use to catch likely hallucinations before they reach the user, and how to react when one is detected.

INTERMEDIATE11 MIN READUPDATED 2026-06-13

In plain English

A large language model will happily produce a fluent, confident answer even when it has no idea what it's talking about. When that answer is factually wrong but presented as true, we call it a hallucination. The model invents a citation, a price, an API method, a refund window — and it never warns you, because to the model a made-up sentence and a true one look identical.

Hallucination Detection — illustration
Hallucination Detection — comet.com

Hallucination detection in production means catching those likely-wrong answers at request time — in the few hundred milliseconds between the model finishing its reply and your app showing it to a user — so you can do something about it before harm is done. This is a runtime guardrail, not an offline test. You are not measuring overall quality on a benchmark; you are deciding, live, for this one answer, whether to trust it.

Think of a sharp intern who writes beautifully but sometimes makes things up. You can't fire them — they're too useful — so you put a quick reviewer next to them. Before any answer goes to the customer, the reviewer asks three cheap questions: Does this match the source documents we gave you? Did you say the same thing twice when I asked twice? Are you actually sure, or just fluent? If the answer fails, the reviewer holds it back. Hallucination detection is that reviewer, automated.

Why it matters

A hallucination that escapes to a user is not a minor quality dip. It is the single failure mode most likely to destroy trust in an AI product, trigger a support ticket, or create real liability.

  • The model gives no warning. Unlike a crash or a timeout, a hallucination returns a perfectly-formed, high-confidence response. Nothing in the HTTP status, latency, or token count tells you it's wrong. If you don't actively check, it ships.
  • Offline eval can't catch this request. You may know your system hallucinates 4% of the time. That statistic does nothing for the specific user staring at the specific wrong answer right now. Only a runtime check can intervene on this response.
  • The cost is asymmetric. In support, legal, medical, or financial tools, one confident fabrication — a wrong dosage, a non-existent legal clause, a made-up policy — outweighs a hundred correct answers. The downside is shaped like a long tail of rare, expensive mistakes.
  • Users over-trust fluency. People read a confident, well-written paragraph as authoritative. The better your model writes, the more dangerous an unflagged hallucination becomes.

Who needs this? Anyone shipping LLM output to users who will act on it. The detection layer is what lets you say "we don't know" instead of guessing, attach a confidence signal, or quietly route a shaky answer to a human. It is a core piece of any LLM guardrails stack and a defining step in moving an LLM prototype to production, where "it usually works in the demo" stops being good enough.

How it works

There is no single hallucination detector. In practice you layer several cheap, imperfect signals and combine their verdicts — because each one catches a different kind of error and none catches all. The pipeline sits between the model's raw output and the user, and ends in a decision: pass, abstain, disclaim, or escalate.

1. Grounding (faithfulness) checks

If your answer is built on retrieved context — the normal case in RAG — the most powerful check is simple: is every claim in the answer actually supported by the retrieved passages? This is a grounding or faithfulness check. You break the answer into individual claims and verify each one against the context. A claim that appears nowhere in the sources is an unsupported claim, and unsupported claims are where hallucinations hide.

The cheapest version is an LLM-as-judge pass: hand a second model the retrieved context and the draft answer and ask it to flag any sentence not entailed by the context. It's the same idea as a natural language inference check — does the source entail this claim, contradict it, or neither? Neither and contradict both mean trouble.

2. Self-consistency sampling

When there is no source to check against (open-ended generation), you can exploit a known property: models tend to be consistent about facts they actually know and erratic about facts they're inventing. So ask the same question two or three times (using a non-zero temperature so answers can vary) and compare. If the key facts agree across samples, confidence rises; if each run gives a different name, number, or date, that disagreement is a strong hallucination signal.

3. Confidence and weak signals

Some models expose token log-probabilities — how probable the model thought each generated token was. Stretches of low-probability tokens often coincide with fabricated spans. It's a noisy signal, not a verdict, but cheap to collect and useful as one input among several. You can also simply ask the model to rate its own confidence, though self-reported confidence is famously unreliable on its own.

The practical pattern is a funnel: run the near-free confidence signal on every request, and only spend money on the expensive grounding or self-consistency check when the cheap signal looks shaky or the request is high-stakes. That keeps average latency and cost low while still covering the dangerous tail.

A worked grounding check

Here is the core of a runtime grounding check for a RAG answer. We extract the claims, ask a judge model whether the retrieved context supports each one, and turn the result into a decision. This is deliberately small — the shape matters more than the lines.

grounding_check.pypython
from anthropic import Anthropic
import json

client = Anthropic()

JUDGE = """You are a strict fact-checker. Given CONTEXT and an ANSWER,
list every factual claim in the ANSWER and mark each as:
  "supported"   - directly stated by the CONTEXT
  "unsupported" - not found in the CONTEXT
  "contradicted"- the CONTEXT says otherwise
Reply ONLY with JSON: {\"claims\": [{\"text\": \"...\", \"verdict\": \"...\"}]}"""

def grounding_verdict(context: str, answer: str) -> dict:
    msg = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=600,
        system=JUDGE,
        messages=[{"role": "user",
                   "content": f"CONTEXT:\n{context}\n\nANSWER:\n{answer}"}],
    )
    claims = json.loads(msg.content[0].text)["claims"]
    bad = [c for c in claims if c["verdict"] != "supported"]
    return {"grounded": len(bad) == 0, "problem_claims": bad}

verdict = grounding_verdict(retrieved_context, draft_answer)
if not verdict["grounded"]:
    answer = abstain_or_escalate(verdict["problem_claims"])  # don't ship as-is

Notice what the check does not do: it doesn't decide if the answer is good, only if it's grounded in the provided sources. That's the right scope for a runtime guardrail. Quality is judged offline; faithfulness is judged live.

What to do when a hallucination is flagged

Detection is only half the job. The product decision — what happens to a flagged answer — is what users actually feel. There are four standard responses, roughly in order of how aggressively they intervene.

ResponseWhat the user seesWhen to use it
Abstain"I don't have a confident answer for that."High-stakes domains where a wrong answer is worse than no answer
DisclaimThe answer, plus "please verify — I'm not certain."Medium-stakes; the answer is probably useful but shaky
Retry / repairA fresh answer after re-retrieving or re-promptingWhen the failure looks fixable (bad chunks, missing context)
Escalate to a humanA handoff, or a queued reply from an agentCritical paths: legal, medical, billing, account actions

The deeper principle: abstaining is a feature, not a failure. A system that knows when to say "I'm not sure" is more trustworthy than one that always answers. Wiring these paths in is part of how you handle LLM failures gracefully rather than letting every uncertain answer leak through.

Common pitfalls

  • Treating the judge as ground truth. Your LLM-as-judge can hallucinate too. Keep it narrow (one claim at a time, structured output), and remember it has a false-positive and a false-negative rate you should measure.
  • Ignoring the latency budget. A grounding check is a whole extra model call. Running the expensive detector on 100% of traffic can double your latency and cost. Gate it behind a cheap pre-filter and reserve it for risky or high-stakes requests.
  • One threshold for everything. The bar for a casual chatbot and for a medical lookup is not the same. Tie your abstain/escalate threshold to the stakes of the request, not a single global number.
  • Confusing confidence with correctness. Low token probability suggests uncertainty, not falsehood — and a model can be fluently, confidently wrong. Use confidence as one weak input, never as the sole gate.
  • No feedback loop. If you never compare flags against real outcomes (user reports, human review), you can't tell whether your detector is catching real hallucinations or just annoying users with false alarms.

Going deeper

Once the basic layered detector is running, the field opens up in a few directions worth knowing.

Claim-level verification with external tools. Beyond checking against your own retrieved context, you can verify high-stakes claims against an authoritative source at request time — re-querying your database, calling a search tool, or looking up a record. This turns detection into a small verification loop and overlaps with how agentic systems decide to double-check themselves. It is slower and pricier, so it's reserved for the claims that truly matter (a quoted price, a legal citation, an account balance).

Sampling-based methods. Research approaches like SelfCheckGPT formalize the self-consistency idea: generate several samples and score how much the facts in the main answer agree with the rest. The intuition is the same one above — disagreement across samples flags likely fabrication — but with a more principled aggregation than a simple vote.

Calibration and thresholds. The real engineering work is choosing where to draw the line. Every threshold trades false positives (blocking good answers, frustrating users) against false negatives (letting hallucinations through). You tune this on logged data, and you'll likely want different thresholds per route. This is where online detection rejoins offline measurement: you need a labeled set to know whether your live detector is actually calibrated.

Where detection lives in the stack. In a mature setup these checks don't sit in your application code — they sit in a shared layer (an LLM gateway or a guardrails service) so every team gets the same protection and the same logs. From there it composes with the rest of your reliability tooling: output validation, content moderation, and failure handling.

The honest bottom line, unchanged for years: no runtime detector is perfect, because deciding whether a fluent sentence is true is, in general, as hard as the original task. The goal is not zero hallucinations reaching users — it's catching enough of the costly ones, cheaply enough to run on every request, and degrading gracefully when you're unsure. A system that knows the limits of its own knowledge beats one that always sounds certain.

FAQ

What is hallucination detection in an LLM application?

It's a runtime check that runs after the model generates an answer but before the user sees it, estimating whether that specific answer is likely fabricated. Common signals are grounding checks against retrieved context, self-consistency across multiple samples, and token-confidence scores. When an answer looks risky, the system can abstain, add a disclaimer, retry, or route to a human.

How do you detect hallucinations in a RAG system at runtime?

Run a grounding (faithfulness) check: break the answer into individual claims and verify each one against the retrieved passages, usually with an LLM-as-judge that marks claims as supported, unsupported, or contradicted. Any unsupported or contradicted claim flags the answer. This works because in RAG you always have the source text to check against.

What is a grounding check or faithfulness check?

It tests whether every claim in the answer is actually supported by the source documents you provided, rather than invented by the model. It's a yes/no judgment about faithfulness to the sources, not about whether the answer is good overall. It's the strongest hallucination signal available whenever your answer is built on retrieved context.

Can token confidence or log-probabilities catch hallucinations?

Partly. Low token probabilities often line up with fabricated spans, so they're a useful cheap signal — but they're noisy, because a model can be confidently wrong and uncertain about something true. Treat confidence as one weak input combined with grounding and consistency checks, never as the only gate.

What should an app do when it detects a likely hallucination?

Pick a response based on the stakes: abstain ("I'm not sure") for high-risk questions, add a disclaimer for medium-risk ones, retry after re-retrieving context if the cause looks fixable, or escalate to a human for critical paths like legal, medical, or billing. Always log the flag and the action so you can tune thresholds later.

Is hallucination detection the same as RAG evaluation?

No. RAG evaluation is offline — it measures how often your whole system hallucinates across a test set before you ship. Hallucination detection is online — it runs on every live request and drives a real-time decision about that one answer. They share faithfulness ideas but do different jobs, and you generally need both.

Further reading