In plain English
An LLM eval (short for evaluation) is a repeatable test for an AI's output. You give the model a set of inputs, collect what it says back, and score those answers against some definition of "good." Run it again next week, after changing your prompt or swapping the model, and you get a number you can compare. That comparable number is the whole point.
Think of it like a spell-checker for behavior instead of spelling. A spell-checker doesn't read your essay and feel that it's fine — it runs every word through a fixed rule and flags failures the same way every time. An eval does that for your AI: same inputs, same scoring, every run, so "better" and "worse" stop being opinions.
Here's the trap evals exist to kill. You build a chatbot, try five questions, the answers look great, you ship. A user asks a sixth question you never tried and the model confidently invents a refund policy that doesn't exist. You had no way to know, because "I tried a few and it looked good" is not a test — it's a vibe. Once you have more than a handful of cases, a human eyeballing every output simply can't keep up, and the cases you don't check are exactly where things break.
Why it matters
Traditional software is deterministic: 2 + 2 returns 4 today, tomorrow, and forever, so a unit test with one assertion is enough. LLMs are not like that. The same prompt can return different wording each time, a vendor can quietly update the model under your API, and a one-word change to your prompt can silently wreck answers for an entire category of questions. You can't assert output == "4" when there are a thousand valid ways to phrase a correct answer.
Evals solve the problem manual checking can't: scale and regression. With ten test cases a human can read them all. With three hundred, nobody will. And the failure mode isn't random — it's that you fix one bug and quietly break three others you weren't looking at. An eval suite re-checks all three hundred cases in minutes, so "I improved the summary prompt" comes with proof you didn't tank the question-answering.
Who should care
- Anyone shipping an LLM feature — support bots, summarizers, RAG search, agents. Without evals you're flying blind on every prompt change.
- Teams comparing models — is the cheaper model good enough for your task? An eval gives a number instead of a hunch.
- Anyone doing prompt engineering — evals are how you tell a prompt tweak that helped from one that looked clever and hurt.
- Production owners — when a vendor updates the model behind your API, an eval is your early-warning alarm that behavior drifted.
What did evals replace? Honestly, nothing — they replaced hope. The old workflow was "change the prompt, click around, ship if it feels right." That works for a demo and falls apart the moment real users send inputs you never imagined. Evals turn that gut-feel loop into prompt iteration you can measure.
How it works
Every eval, no matter how fancy, is the same four-part loop: a dataset of inputs (often with expected answers), the system under test that produces outputs, a scorer that grades each output, and an aggregate that rolls the scores into one number you track over time.
The interesting design choice is the scorer — how you decide an answer is good. There are three families, and good eval suites mix them:
| Scorer type | How it grades | Best for |
|---|---|---|
| Exact / rule-based | String match, regex, valid JSON, contains keyword | Classification, extraction, structured output |
| Reference-based | Compare to a known good answer (similarity, F1) | Translation, factual Q&A with a fixed answer |
| Model-graded | Another LLM scores the answer against a rubric | Open-ended writing, tone, helpfulness |
Rule-based scorers are cheap, instant, and deterministic — always reach for them first. But "was this summary helpful and faithful?" has no regex. That's where model-graded scoring comes in: you ask a separate LLM to judge the output against a rubric. It's the workhorse for subjective tasks, and it has its own gotchas — read What Is LLM-as-a-Judge? before you lean on it.
Two more terms you'll see constantly. Offline evals run against a fixed dataset before you ship — your regression test suite. Online evals score real production traffic after you ship, where you usually have no "correct" answer to compare against, so you lean on model-graded checks and user signals. They feed each other: production failures become new offline test cases.
- Fixed curated dataset
- Has expected answers
- Runs in CI, pre-ship
- Catches regressions
- Real user traffic
- No reference answer
- Runs in production
- Catches drift + edge cases
Build your first eval
You don't need a framework to start. An eval is just a loop, a scorer, and a percentage. Here's a complete one for a classifier that should label support tickets as billing, bug, or other. The scorer is dead simple — exact match — because the output space is closed.
import json
from anthropic import Anthropic
client = Anthropic(api_key="sk-...") # placeholder
# 1. The dataset: inputs + the label we expect.
DATASET = [
{"text": "You charged my card twice this month", "expected": "billing"},
{"text": "The app crashes when I tap export", "expected": "bug"},
{"text": "What are your office hours?", "expected": "other"},
# ...add 50+ real, messy examples here
]
def classify(text: str) -> str:
"""The system under test."""
msg = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=10,
messages=[{
"role": "user",
"content": f"Label this ticket as exactly one of "
f"billing, bug, or other. Reply with only the word.\n\n{text}",
}],
)
return msg.content[0].text.strip().lower()
# 2. Run + 3. score each case.
results = []
for case in DATASET:
got = classify(case["text"])
passed = got == case["expected"] # exact-match scorer
results.append(passed)
if not passed:
print(f"FAIL expected={case['expected']!r} got={got!r} {case['text']!r}")
# 4. Aggregate into one tracked number.
accuracy = sum(results) / len(results)
print(f"\nAccuracy: {accuracy:.1%} ({sum(results)}/{len(results)})")Run it, get Accuracy: 88.0%, and now "better" has a definition. Change the prompt, run again, and the number tells you the truth. Print every failure — reading the FAILs is where you learn what your prompt is actually getting wrong, which is the real payoff of an eval.
When exact match won't cut it — "is this summary faithful to the source?" — you swap the passed = got == expected line for a model-graded scorer: a second LLM call that returns a 1 or 0 against a rubric. The loop, the dataset, and the aggregate stay identical. That swap-in is the entire trick. This pattern is also how you measure retrieval quality in RAG evaluation.
What to actually measure
The hardest part of evals isn't the code — it's deciding what good means for your task. "Accuracy" is meaningless until you define the criteria. A few that come up over and over:
- Correctness — is the answer factually right? The baseline for Q&A and extraction.
- Faithfulness / groundedness — does the answer stick to the provided source, or did the model hallucinate? Critical for RAG and summarization.
- Format / schema — is it valid JSON, the right length, the required fields present? A cheap rule-based check that catches a huge class of bugs.
- Relevance — did it answer the question that was asked, or wander off?
- Safety / refusal — does it refuse what it should and not over-refuse harmless requests?
- Tone & style — does it match your brand voice? Almost always model-graded.
Pick the two or three criteria that matter most for your use case and ignore the rest at first. A support bot lives or dies on correctness and faithfulness; a creative-writing tool cares about tone. Trying to measure everything on day one is the fastest way to never ship an eval at all.
The tool landscape
You can write evals from scratch — the example above is the whole idea — but frameworks save you boilerplate, give you nice reports, and ship pre-built scorers for common criteria. The big names you'll run into:
| Tool | What it is | Good when |
|---|---|---|
| OpenAI Evals | Open-source eval framework + benchmark registry | You want a battle-tested harness and Python |
| DeepEval | pytest-style LLM testing with built-in metrics | You already think in unit tests |
| LangSmith | Hosted datasets, tracing, offline + online evals | You want a UI and production monitoring |
| Ragas | Metrics specialized for RAG pipelines | Your app retrieves before it answers |
| promptfoo | Config-driven eval + prompt comparison CLI | You want fast side-by-side prompt tests |
Don't agonize over the choice. The framework matters far less than having a dataset and running it regularly. Most teams start with a 50-line script like the one above, then graduate to a tool once they need shared datasets, dashboards, or observability on production traffic. The discipline of running the eval beats the sophistication of the harness every time.
Going deeper
Once your eval suite is part of daily life, a harder set of problems shows up. These are the things that separate a toy eval from one teams actually trust.
Grading the grader
If a model judges your outputs, who judges the judge? Model-graded scorers have real biases: they favor longer answers, they prefer outputs that match their own style, and they can be inconsistent run to run. The fix is to validate the judge against human labels — have people grade a sample, then check how often the LLM judge agrees. If agreement is low, your eval is measuring the judge's quirks, not your app's quality. This is the central pitfall covered in What Is LLM-as-a-Judge?.
Non-determinism and statistical noise
Run the same eval twice and the score can wobble by a point or two, because generation is sampled, not fixed (the temperature setting controls how much). So a prompt change that moves accuracy from 87% to 88% may be pure noise. Mitigations: set temperature to 0 for graded runs where it makes sense, use enough cases that small differences are meaningful, and treat tiny score changes with suspicion rather than celebration.
Evaluating agents and multi-step traces
Scoring a single answer is easy. Scoring an agent that planned, called five tools, and looped is not — the final answer can be right while the path was wasteful or wrong, or the answer can be wrong because of one bad tool call early. Serious agent evals score trajectories: did it pick the right tool, in the right order, without burning a fortune in tokens? This is an active frontier with far fewer settled best practices than single-turn evals.
Data contamination and overfitting
Two quieter risks. Contamination: if your test cases leaked into a model's training data, high scores are meaningless — the model memorized the answers. Overfitting to your eval: tweak prompts against the same 50 cases long enough and you'll ace those 50 while quietly getting worse on everything else. Defend by keeping a held-out set you tune against rarely, and by continuously refreshing cases from live traffic so the eval can't go stale.
FAQ
What is an LLM eval in simple terms?
It's a repeatable test for an AI's outputs: you feed the model a fixed set of inputs, score its answers against a definition of "good," and get a comparable number. Re-run it after any change and you know instantly whether you helped or hurt.
Why can't I just read the outputs myself?
Eyeballing works for ten cases and collapses past a few dozen — no human re-reads 300 outputs on every prompt tweak. Worse, fixing one thing often silently breaks others you weren't looking at, and manual checking can't catch that. An eval re-scores every case in minutes.
How many test cases do I need to start?
Start with 10–20 cases drawn from real or realistic inputs, not invented ones. A tiny eval of true user messages is more useful than a huge eval of imagined cases. Grow the set every time production surprises you, turning each bug into a permanent test.
What's the difference between an eval and a benchmark?
A benchmark (like MMLU or GPQA) is a public, standardized test that ranks raw models. An eval usually means testing your own application on your own data and criteria. Same loop, different scope — see What Are LLM Benchmarks?.
Do I need a framework like DeepEval or LangSmith?
No. A 50-line script with a dataset, a scorer, and a percentage is a real eval. Frameworks add pre-built metrics, dashboards, and production monitoring, which help once you scale — but the discipline of running the eval matters far more than the harness you run it with.
How do you evaluate open-ended answers with no single correct response?
Use model-graded scoring: a separate LLM grades each output against a written rubric (helpful? faithful? right tone?). It's the standard approach for subjective tasks, but validate the judge against human labels first, since LLM judges have biases like favoring longer answers.