In plain English
Once you build something on top of a language model, you face one nagging question: is it any good? "Good" is slippery for an LLM — the same prompt can produce a great answer today and a clumsy one tomorrow, and small changes to your prompt or model can break things you never thought to check. Evaluation ("eval" for short) is how you turn that vague worry into a number you can track.

There are two fundamentally different moments to measure quality, and they answer two different questions:
- Offline evaluation runs before you ship. You take a fixed set of test cases you prepared in advance, run your system over them, and score the results. Nothing real is at stake — it's a rehearsal. It answers: should I release this change at all?
- Online evaluation runs after you ship, on real users. You watch live traffic, sample real answers, collect feedback, and score what actually happened. It answers: is this working for real people, right now?
An everyday analogy: think of a new airline pilot. Offline eval is the flight simulator — a controlled rehearsal on scenarios you chose, where a crash costs nothing and you can repeat it a thousand times. Online eval is the cockpit instruments on a real flight — altitude, fuel, engine temperature — telling you how the actual journey, with real weather and real passengers, is going. No airline would skip the simulator before letting a pilot fly. And none would fly blind once airborne. You need both, for different reasons.
Why it matters
If you only had to test ordinary software, one kind of testing would do. You write a unit test, it passes or fails, and a passing test today passes tomorrow. LLM systems break that comfortable rule in two ways, and each broken rule forces one of the two eval types into existence.
Why you can't ship without offline eval
Every change to an LLM app is a leap of faith unless you measure it. You swap claude-sonnet-4-6 for a newer model to save money — does answer quality hold? You tweak three words in a system prompt to fix one bug — did you quietly break five other things? You'd never deploy a database migration without running it against test data first. Offline eval is that test run: a repeatable, automatic check you can put in CI so a quality regression blocks the merge, exactly like a failing unit test. Without it, every release is a coin flip you watch land in production.
Why offline eval is never enough
Here's the catch that surprises new teams: your test set is a guess about what users will ask, and real users never read your guess. They paste in formats you didn't anticipate, ask in languages you didn't test, chain follow-up questions, and probe edge cases you couldn't imagine. Your offline scores can be excellent while real-world quality quietly sags — because the real query distribution doesn't match your dataset. The only way to know how the system behaves on real traffic is to measure real traffic. That's online eval, and it catches the failures your simulator never dreamed up.
So the two exist because they cover each other's blind spots. Offline can gate a release but can't see reality; online sees reality but can't gate a release (the users already got the bad answer). Use one without the other and you're either flying blind or never taking off.
How it works
Both kinds of eval share the same skeleton: get some inputs, run your system, score the outputs. What differs is where the inputs come from, when it runs, and how you score. Walk the two pipelines side by side and the difference becomes concrete.
The offline pipeline: a gate before deploy
Offline eval starts with a golden dataset — a curated list of representative inputs, often with an expected answer or a quality rubric attached. You assemble it from real past queries, known tricky cases, and bugs you've fixed before (so they can never silently return). On every change, your CI runs the whole system over this fixed set and scores each output, then compares the average against the last release. If the score drops past a threshold, the build fails and the change doesn't ship.
The online pipeline: a monitor after deploy
Online eval has no prepared dataset — the inputs are live user traffic. You log real requests and responses, then continuously sample a slice of them and score it. The scores come from signals you can't get offline: explicit user feedback (thumbs up/down), implicit signals (did the user retry, rephrase, or abandon?), and automated judges run on sampled live answers. You watch these as a dashboard over time and alert when quality drifts. This is the eval side of LLM observability.
How you actually score (both pipelines)
Scoring an LLM output is harder than checking result == 42, because most answers are open-ended text with many valid forms. Three common methods, in rough order of cost:
- Exact / rule-based checks — for outputs with a right answer (a number, a JSON shape, a classification label). Cheap, fast, fully objective. Works offline; works online too where the task has a checkable answer.
- Reference-based metrics — compare the output against a known-good reference answer. Needs labelled data, so it mostly lives offline. Useful when you can write the ideal answer in advance.
- LLM-as-judge — ask a separate model to grade the answer against a rubric ("is this faithful to the source? is it on-topic? is the tone right?"). Flexible enough to score open-ended text, so it powers both offline scoring and online sampling. The trade-off: a judge is itself an LLM, so you have to validate that its scores agree with human judgment.
Online adds one method offline can't have: real human signal. A thumbs-down from an actual user, or a user who immediately rephrases the question, is ground truth about whether the answer worked — data no simulator can fake.
Offline vs online, side by side
It helps to see the two laid against each other on every axis that matters. The short version: offline is controlled, repeatable, and cheap but artificial; online is real, messy, and slow but truthful.
| Offline eval | Online eval | |
|---|---|---|
| When it runs | Before release, in CI | After release, on live traffic |
| Inputs | Fixed golden dataset you curated | Real user queries as they arrive |
| Main question | Should I ship this change? | Is it working for real people now? |
| Scoring | Expected answers, rules, LLM-judge | User feedback, implicit signals, sampled judge |
| Speed | Fast — run anytime, repeatedly | Slow — must wait for real traffic |
| Cost / risk | Cheap, zero user impact | Users see the output before you score it |
| Can gate a release? | Yes — it's the gate | No — answer already shipped |
| Blind spot | Misses the real query distribution | Can't prevent a bad release |
Read the last two rows together and the whole relationship clicks: each one's strength is the other's weakness. Offline gates but can't see reality; online sees reality but can't gate. That's precisely why a serious system runs both, not one.
A worked example: upgrading the model
Suppose you run a support bot and want to move to a newer, cheaper model. Watch how the two evals divide the work across the lifecycle of that one change.
- Offline gate. You run your 300-case golden dataset against both the old and new model and compare scores. The new model matches on accuracy and improves on cost — so it passes the gate and is cleared to ship. Offline has done its one job: stopped a bad change, approved a good one.
- Careful rollout. You don't flip everyone over at once. You send the new model a small slice of live traffic and run A/B testing, or route a copy of real queries to it silently with shadow mode so users never see its output while you score it.
- Online monitor. On live traffic you discover something offline missed entirely: users paste long error logs your golden set never contained, and the new model truncates them. Thumbs-down rates tick up on exactly those queries. Online eval caught a real-distribution failure your simulator couldn't have known to test.
- Close the loop. You add a few of those error-log cases to the golden dataset, fix the prompt, re-run the offline gate, and roll forward. Your simulator is now smarter because reality taught it. See model upgrade rollout for the full playbook.
Neither eval alone would have caught the whole story. Offline cleared a change that looked safe; online revealed the part offline couldn't see; and the two together turned a risky upgrade into a controlled one.
Common pitfalls
Both eval types are easy to set up badly in ways that give you false confidence — the worst outcome, because a misleading score is more dangerous than no score.
- A stale or tiny golden dataset. Fifteen cherry-picked examples that all pass tell you nothing. A golden set must be representative and must grow as you learn what real users do — otherwise offline scores stay green while reality drifts away.
- Treating offline scores as the truth. "95% on our eval set" is 95% on your guesses about reality, not reality. Never let a strong offline number talk you out of monitoring live traffic.
- Trusting an unvalidated judge. An LLM-as-judge can be confidently wrong or biased toward longer/flattering answers. Spot-check its scores against human ratings before you rely on it to gate releases.
- No online eval at all. Logging raw requests is not evaluation — it's storage. If nobody scores or alerts on the sampled traffic, you'll only learn about a regression when an angry user files a ticket.
- Sampling bias online. If only frustrated users click thumbs-down, your signal skews negative; if you only sample easy queries, it skews positive. Sample deliberately and combine explicit feedback with implicit signals.
Going deeper
Once the offline-gate / online-monitor split is second nature, a few finer points separate a basic setup from a robust one.
The line between them blurs in practice. Shadow mode and canary releases are deliberate hybrids: they run a candidate on real traffic (online's realism) but withhold its output from users or limit blast radius (offline's safety). Think of eval as a spectrum from "fully simulated, zero risk" to "fully live, full risk," with these techniques sitting in the safe middle. You move a change rightward along that spectrum as your confidence grows.
Offline eval is only as honest as its data leakage. If you build your golden set from examples you also used to write the prompt, you're grading on the answers — scores look great and generalize poorly. Keep a held-out slice the system has never been tuned against, the same discipline that separates training from a test set in classic machine learning, related to the training-vs-inference split between building and running a system.
Online eval has a feedback-latency problem. For a chatbot, you learn quality in seconds. For a system whose output is acted on later — a generated contract, a code suggestion, a research summary — the true "was this good?" signal can arrive days later, or never. Designing proxy signals (did the user accept the suggestion? edit it heavily? escalate to a human?) is much of the real craft of online eval.
Where this fits in the bigger picture. Offline and online eval together are the measurement layer of LLMOps — the practice of running LLM systems in production. Next steps worth reading: the general discipline of testing LLM apps, A/B testing for comparing variants on live traffic, and shadow-mode testing for risk-free online trials. The durable lesson: offline tells you whether to let go of the rope, online tells you whether you're still on the wall — and you climb safely only when you have both.
FAQ
What is the difference between offline and online evaluation of an LLM?
Offline evaluation runs before release on a fixed, curated dataset of test cases — it's a rehearsal that gates whether a change ships. Online evaluation runs after release on real user traffic, scoring live answers from feedback and sampling to tell you how the system performs for actual users. Offline asks "should I ship this?"; online asks "is it working now?"
Can offline evaluation replace online evaluation?
No. Your offline dataset is a guess about what users will ask, and real users always send queries you didn't anticipate — different formats, languages, and edge cases. Offline scores can stay high while real-world quality drops, so you still need online eval to measure the real query distribution. They cover each other's blind spots.
What is a golden dataset in LLM evaluation?
A golden dataset is a curated set of representative test inputs, usually paired with an expected answer or a scoring rubric, that you run your system against during offline evaluation. You build it from real past queries, known tricky cases, and previously fixed bugs, and you grow it over time as online eval surfaces new real-world failures.
How do you score open-ended LLM answers that have no single right answer?
Three common methods: rule-based checks for outputs with a checkable answer (a number or JSON shape), reference-based comparison against a known-good answer, and LLM-as-judge, where a separate model grades the output against a rubric. Online eval adds real human signal too — thumbs up/down and whether the user retried or abandoned.
Where do offline evals run in the development workflow?
Offline evals typically run in continuous integration (CI), automatically, on every change. The system runs over the golden dataset, scores are compared against the previous release, and if quality drops past a threshold the build fails and the change is blocked — exactly like a failing unit test gates a merge.
Is shadow mode offline or online evaluation?
Shadow mode is a hybrid that leans online: it runs a candidate model or prompt on real live traffic (online realism) but hides its output from users so nothing is at stake (offline safety). It sits in the safe middle of the spectrum between a fully simulated test and a fully live release.