What Are LLM Evals? Why "It Looks Good" Isn't Enough

Understand what an LLM eval is, why eyeballing outputs stops working past ten test cases, and what a real eval pipeline looks like.

BEGINNER11 MIN READUPDATED 2026-06-11

In plain English

An LLM eval (short for evaluation) is a repeatable test for an AI's output. You give the model a set of inputs, collect what it says back, and score those answers against some definition of "good." Run it again next week, after changing your prompt or swapping the model, and you get a number you can compare. That comparable number is the whole point.

Think of it like a spell-checker for behavior instead of spelling. A spell-checker doesn't read your essay and feel that it's fine — it runs every word through a fixed rule and flags failures the same way every time. An eval does that for your AI: same inputs, same scoring, every run, so "better" and "worse" stop being opinions.

Here's the trap evals exist to kill. You build a chatbot, try five questions, the answers look great, you ship. A user asks a sixth question you never tried and the model confidently invents a refund policy that doesn't exist. You had no way to know, because "I tried a few and it looked good" is not a test — it's a vibe. Once you have more than a handful of cases, a human eyeballing every output simply can't keep up, and the cases you don't check are exactly where things break.

Why it matters

Traditional software is deterministic: 2 + 2 returns 4 today, tomorrow, and forever, so a unit test with one assertion is enough. LLMs are not like that. The same prompt can return different wording each time, a vendor can quietly update the model under your API, and a one-word change to your prompt can silently wreck answers for an entire category of questions. You can't assert output == "4" when there are a thousand valid ways to phrase a correct answer.

Evals solve the problem manual checking can't: scale and regression. With ten test cases a human can read them all. With three hundred, nobody will. And the failure mode isn't random — it's that you fix one bug and quietly break three others you weren't looking at. An eval suite re-checks all three hundred cases in minutes, so "I improved the summary prompt" comes with proof you didn't tank the question-answering.

Who should care

Anyone shipping an LLM feature — support bots, summarizers, RAG search, agents. Without evals you're flying blind on every prompt change.
Teams comparing models — is the cheaper model good enough for your task? An eval gives a number instead of a hunch.
Anyone doing prompt engineering — evals are how you tell a prompt tweak that helped from one that looked clever and hurt.
Production owners — when a vendor updates the model behind your API, an eval is your early-warning alarm that behavior drifted.

What did evals replace? Honestly, nothing — they replaced hope. The old workflow was "change the prompt, click around, ship if it feels right." That works for a demo and falls apart the moment real users send inputs you never imagined. Evals turn that gut-feel loop into prompt iteration you can measure.

How it works

Every eval, no matter how fancy, is the same four-part loop: a dataset of inputs (often with expected answers), the system under test that produces outputs, a scorer that grades each output, and an aggregate that rolls the scores into one number you track over time.

// The eval loop

Datasetinputs + expectedRun modelcollect outputsScore eachpass / fail / 0–1Aggregateone tracked number

The interesting design choice is the scorer — how you decide an answer is good. There are three families, and good eval suites mix them:

Scorer type	How it grades	Best for
Exact / rule-based	String match, regex, valid JSON, contains keyword	Classification, extraction, structured output
Reference-based	Compare to a known good answer (similarity, F1)	Translation, factual Q&A with a fixed answer
Model-graded	Another LLM scores the answer against a rubric	Open-ended writing, tone, helpfulness

Rule-based scorers are cheap, instant, and deterministic — always reach for them first. But "was this summary helpful and faithful?" has no regex. That's where model-graded scoring comes in: you ask a separate LLM to judge the output against a rubric. It's the workhorse for subjective tasks, and it has its own gotchas — read What Is LLM-as-a-Judge? before you lean on it.

Two more terms you'll see constantly. Offline evals run against a fixed dataset before you ship — your regression test suite. Online evals score real production traffic after you ship, where you usually have no "correct" answer to compare against, so you lean on model-graded checks and user signals. They feed each other: production failures become new offline test cases.

// Offline vs online evals

Offline

Fixed curated dataset
Has expected answers
Runs in CI, pre-ship
Catches regressions

Online

Real user traffic
No reference answer
Runs in production
Catches drift + edge cases

Build your first eval

You don't need a framework to start. An eval is just a loop, a scorer, and a percentage. Here's a complete one for a classifier that should label support tickets as billing, bug, or other. The scorer is dead simple — exact match — because the output space is closed.

eval_classifier.pypython

import json
from anthropic import Anthropic

client = Anthropic(api_key="sk-...")  # placeholder

# 1. The dataset: inputs + the label we expect.
DATASET = [
    {"text": "You charged my card twice this month", "expected": "billing"},
    {"text": "The app crashes when I tap export", "expected": "bug"},
    {"text": "What are your office hours?", "expected": "other"},
    # ...add 50+ real, messy examples here
]

def classify(text: str) -> str:
    """The system under test."""
    msg = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=10,
        messages=[{
            "role": "user",
            "content": f"Label this ticket as exactly one of "
                       f"billing, bug, or other. Reply with only the word.\n\n{text}",
        }],
    )
    return msg.content[0].text.strip().lower()

# 2. Run + 3. score each case.
results = []
for case in DATASET:
    got = classify(case["text"])
    passed = got == case["expected"]          # exact-match scorer
    results.append(passed)
    if not passed:
        print(f"FAIL  expected={case['expected']!r} got={got!r}  {case['text']!r}")

# 4. Aggregate into one tracked number.
accuracy = sum(results) / len(results)
print(f"\nAccuracy: {accuracy:.1%}  ({sum(results)}/{len(results)})")

Run it, get Accuracy: 88.0%, and now "better" has a definition. Change the prompt, run again, and the number tells you the truth. Print every failure — reading the FAILs is where you learn what your prompt is actually getting wrong, which is the real payoff of an eval.

When exact match won't cut it — "is this summary faithful to the source?" — you swap the passed = got == expected line for a model-graded scorer: a second LLM call that returns a 1 or 0 against a rubric. The loop, the dataset, and the aggregate stay identical. That swap-in is the entire trick. This pattern is also how you measure retrieval quality in RAG evaluation.

What to actually measure

The hardest part of evals isn't the code — it's deciding what good means for your task. "Accuracy" is meaningless until you define the criteria. A few that come up over and over:

Correctness — is the answer factually right? The baseline for Q&A and extraction.
Faithfulness / groundedness — does the answer stick to the provided source, or did the model hallucinate? Critical for RAG and summarization.
Format / schema — is it valid JSON, the right length, the required fields present? A cheap rule-based check that catches a huge class of bugs.
Relevance — did it answer the question that was asked, or wander off?
Safety / refusal — does it refuse what it should and not over-refuse harmless requests?
Tone & style — does it match your brand voice? Almost always model-graded.

Pick the two or three criteria that matter most for your use case and ignore the rest at first. A support bot lives or dies on correctness and faithfulness; a creative-writing tool cares about tone. Trying to measure everything on day one is the fastest way to never ship an eval at all.

The tool landscape

You can write evals from scratch — the example above is the whole idea — but frameworks save you boilerplate, give you nice reports, and ship pre-built scorers for common criteria. The big names you'll run into:

Tool	What it is	Good when
OpenAI Evals	Open-source eval framework + benchmark registry	You want a battle-tested harness and Python
DeepEval	`pytest`-style LLM testing with built-in metrics	You already think in unit tests
LangSmith	Hosted datasets, tracing, offline + online evals	You want a UI and production monitoring
Ragas	Metrics specialized for RAG pipelines	Your app retrieves before it answers
promptfoo	Config-driven eval + prompt comparison CLI	You want fast side-by-side prompt tests

Don't agonize over the choice. The framework matters far less than having a dataset and running it regularly. Most teams start with a 50-line script like the one above, then graduate to a tool once they need shared datasets, dashboards, or observability on production traffic. The discipline of running the eval beats the sophistication of the harness every time.

// Evals in your dev loop

Change promptor modelRun evalall casesRead failureswhat broke?Add casesfrom prod bugs↺ repeat

Going deeper

Once your eval suite is part of daily life, a harder set of problems shows up. These are the things that separate a toy eval from one teams actually trust.

Grading the grader

If a model judges your outputs, who judges the judge? Model-graded scorers have real biases: they favor longer answers, they prefer outputs that match their own style, and they can be inconsistent run to run. The fix is to validate the judge against human labels — have people grade a sample, then check how often the LLM judge agrees. If agreement is low, your eval is measuring the judge's quirks, not your app's quality. This is the central pitfall covered in What Is LLM-as-a-Judge?.

Non-determinism and statistical noise

Run the same eval twice and the score can wobble by a point or two, because generation is sampled, not fixed (the temperature setting controls how much). So a prompt change that moves accuracy from 87% to 88% may be pure noise. Mitigations: set temperature to 0 for graded runs where it makes sense, use enough cases that small differences are meaningful, and treat tiny score changes with suspicion rather than celebration.

Evaluating agents and multi-step traces

Scoring a single answer is easy. Scoring an agent that planned, called five tools, and looped is not — the final answer can be right while the path was wasteful or wrong, or the answer can be wrong because of one bad tool call early. Serious agent evals score trajectories: did it pick the right tool, in the right order, without burning a fortune in tokens? This is an active frontier with far fewer settled best practices than single-turn evals.

Data contamination and overfitting

Two quieter risks. Contamination: if your test cases leaked into a model's training data, high scores are meaningless — the model memorized the answers. Overfitting to your eval: tweak prompts against the same 50 cases long enough and you'll ace those 50 while quietly getting worse on everything else. Defend by keeping a held-out set you tune against rarely, and by continuously refreshing cases from live traffic so the eval can't go stale.

FAQ

What is an LLM eval in simple terms?

It's a repeatable test for an AI's outputs: you feed the model a fixed set of inputs, score its answers against a definition of "good," and get a comparable number. Re-run it after any change and you know instantly whether you helped or hurt.

Why can't I just read the outputs myself?

Eyeballing works for ten cases and collapses past a few dozen — no human re-reads 300 outputs on every prompt tweak. Worse, fixing one thing often silently breaks others you weren't looking at, and manual checking can't catch that. An eval re-scores every case in minutes.

How many test cases do I need to start?

Start with 10–20 cases drawn from real or realistic inputs, not invented ones. A tiny eval of true user messages is more useful than a huge eval of imagined cases. Grow the set every time production surprises you, turning each bug into a permanent test.

What's the difference between an eval and a benchmark?

A benchmark (like MMLU or GPQA) is a public, standardized test that ranks raw models. An eval usually means testing your own application on your own data and criteria. Same loop, different scope — see What Are LLM Benchmarks?.

Do I need a framework like DeepEval or LangSmith?

No. A 50-line script with a dataset, a scorer, and a percentage is a real eval. Frameworks add pre-built metrics, dashboards, and production monitoring, which help once you scale — but the discipline of running the eval matters far more than the harness you run it with.

How do you evaluate open-ended answers with no single correct response?

Use model-graded scoring: a separate LLM grades each output against a written rubric (helpful? faithful? right tone?). It's the standard approach for subjective tasks, but validate the judge against human labels first, since LLM judges have biases like favoring longer answers.

// In plain English

// Why it matters

Who should care

// How it works

// Build your first eval

// What to actually measure

// The tool landscape

// Going deeper

Grading the grader

Non-determinism and statistical noise

Evaluating agents and multi-step traces

Data contamination and overfitting

// FAQ

// Further reading

// Related

In plain English

Why it matters

How it works

Build your first eval

What to actually measure

The tool landscape

Going deeper

FAQ

Further reading

Related