Code-Graded vs Model-Graded vs Human-Graded Evals

Understand the three grading methods for LLM outputs and be able to pick the right mix for any eval.

INTERMEDIATE12 MIN READUPDATED 2026-06-12

In plain English

When you run an LLM eval you need someone — or something — to look at each output and say "pass" or "fail." That grader is at the heart of your whole evaluation strategy, and you have exactly three options: a piece of code that checks the output mechanically, another AI model that reads the output and judges it, or a human who reads and scores it the old-fashioned way.

Code-Graded vs Model-Graded vs Human-Graded Evals — diagram — Code-Graded vs Model-Graded vs Human-Graded Evals — origin-prod-wpengine.petplate.dev

Think of it like grading a high-school exam. Some questions are multiple-choice: a Scantron machine grades them in seconds, perfectly, for pennies — that's code grading. Some questions are short-answer and a teacher's assistant can mark them quickly with a rubric — that's model grading (the model plays the TA). The open-ended essay that requires real judgment, cultural nuance, or a final decision about a policy — that goes to the human grader. The right exam uses all three, because each question type fits one grader better than the others.

Why it matters

The grading method you choose determines three things: how fast your eval runs, how much it costs, and how accurate the scores are. Get this wrong and you end up with evals that are either too slow to run on every pull request, too expensive to run on realistic dataset sizes, or so noisy that a real regression looks like normal variance and you ship a broken model.

A common mistake is starting with human review because "it's the most accurate" and then giving up on evals entirely when the cost hits. The opposite mistake is coding up a regex check for a task that requires nuanced judgment, then trusting a green scorecard that means nothing. Matching the grader to the task is a core engineering decision, not an afterthought.

The core tension

Every grading method sits somewhere on two axes: cost vs. scale and accuracy vs. speed. Code is fast and free but can only check things that have a single objectively correct form. Humans are slow and expensive but catch everything. Model judges fill the middle: they can reason about quality like a human but run automatically like code. The art is knowing which method belongs to which part of your test suite.

How it works

Each grading method follows a different mechanical path from raw model output to a score.

// Three grading paths

Code-Graded

Model output arrives
Parse / extract fields
Run deterministic check
Return pass / fail / score

Model-Graded

Model output arrives
Build judge prompt + rubric
Call judge LLM
Parse structured score

Human-Graded

Model output arrives
Queue for review
Annotator reads + scores
Aggregate labels

Code-graded evals

A code-graded eval is just a function. It takes the model output as a string (or parsed object) and returns a number or boolean using deterministic logic: exact string match, regex, JSON schema validation, executing the model's code and checking stdout, comparing a numeric result to a tolerance, or asserting that a required field is present. No network calls, no randomness, no cost per evaluation beyond compute.

Example: code-graded checkspython

import json, re

def grade_json_output(output: str) -> float:
    """Check that the model produced valid JSON with required keys."""
    try:
        data = json.loads(output)
    except json.JSONDecodeError:
        return 0.0
    required = {"summary", "sentiment", "topics"}
    present = required.intersection(data.keys())
    return len(present) / len(required)

def grade_sql_output(output: str) -> bool:
    """Reject obvious SQL injection patterns."""
    danger = re.compile(r"(DROP|DELETE|INSERT|UPDATE)\s", re.I)
    return not bool(danger.search(output))

def grade_exact_match(output: str, expected: str) -> bool:
    return output.strip().lower() == expected.strip().lower()

Model-graded evals (LLM-as-a-judge)

A model-graded eval sends the model's output to a separate judge LLM — usually a larger or more capable model — along with a rubric prompt that describes what makes an answer good. The judge returns a structured score (often a number from 1 to 5, or a pass/fail with a brief reasoning chain). Because the judge is itself a language model, it can read natural language, apply context, spot hallucinations, and evaluate qualities like helpfulness or tone that no regex can capture.

Example: model-graded judge call (Anthropic SDK)python

import anthropic

client = anthropic.Anthropic()

RUBRIC = """
You are a strict evaluator. Score the following answer on a scale of 1-5.

Question: {question}
Answer: {answer}

Criteria:
- 5: Correct, concise, no hallucinations, cites evidence where appropriate.
- 3: Mostly correct but vague or missing important detail.
- 1: Incorrect, misleading, or harmful.

Return ONLY a JSON object: {{"score": <int>, "reason": "<one sentence>"}}
"""

def model_grade(question: str, answer: str) -> dict:
    prompt = RUBRIC.format(question=question, answer=answer)
    msg = client.messages.create(
        model="claude-opus-4-8",
        max_tokens=128,
        messages=[{"role": "user", "content": prompt}]
    )
    import json
    return json.loads(msg.content[0].text)

Human-graded evals

Human-graded evals route model outputs to a queue where trained annotators apply a rubric and record a label. The tooling varies — Braintrust, Label Studio, Scale AI, or even a shared spreadsheet — but the mechanics are the same: a person reads the output, consults the rubric, and clicks a rating. Because humans understand context, culture, implication, and risk in ways no current model reliably replicates, human grading is the ground truth your entire eval system calibrates against.

When to use each method

The method follows from the task, not from preference. Below is a practical decision guide.

Task type	Best grader	Why
Structured output (JSON, SQL, regex pattern)	Code	Only one correct form; instant and free
Exact answer (capitals, dates, entity names)	Code	Correct answer known ahead of time
Code correctness (does it run, does it produce right output)	Code	Execute and check stdout / return value
Summary quality, tone, helpfulness	Model judge	Subjective; no single correct string
Factual accuracy in long-form text	Model judge	Requires reading and reasoning
RAG faithfulness (answer grounded in context?)	Model judge	Needs to compare output to source document
High-stakes decisions (medical, legal, financial)	Human	Risk of model-judge errors too costly
Novel task with no established rubric yet	Human	Need to discover what 'good' looks like first
Calibrating a new model judge	Human	Validate judge before trusting it at scale

A useful rule of thumb: if you could write a Python assert that perfectly captures "correct," use code. If you could explain what makes an answer good in a paragraph and a TA could reliably apply it, use a model judge. If the stakes are too high to trust an automated scorer or you genuinely don't know what good looks like yet, use humans.

Cost and scale in practice

Code-graded evals are essentially free. You can run thousands of them per second with no per-call cost. Model-graded evals cost API tokens — typically between $0.02 and $0.55 per evaluation depending on the judge model and output length. Human-graded evals cost between $0.10 and $5 per label depending on complexity and whether annotators need domain expertise. For a dataset of 1,000 outputs, human grading can run from $100 to $5,000; a model judge on the same set might cost $5 to $50.

Pitfalls and known biases

Each grading method has failure modes. Knowing them lets you design around them rather than discover them after a bad release.

Code-graded pitfalls

False negatives from surface variation — "New York" vs. "new york" vs. "NYC" are all correct but a naive exact-match fails two of them. Always normalize before comparing.
Checking the wrong thing — a regex that confirms SQL starts with SELECT says nothing about whether the query answers the user's actual question.
Brittle JSON parsing — if the model wraps JSON in a markdown code fence, json.loads() will throw. Strip fences before parsing.

Model-graded pitfalls

Position bias — in pairwise comparisons, some judge models favor whichever response appears first in the prompt. Mitigate by running both orderings and taking the consistent result.
Verbosity bias — judge models often rate longer answers higher regardless of quality. Counter this by explicitly penalizing unnecessary length in your rubric.
Self-preference bias — a model tends to rate outputs that match its own style higher. Use a different model family as your judge than the one generating outputs.
Rubric sensitivity — small wording changes in the judge prompt can shift scores by a full point on a 5-point scale. Pin your rubric prompt to a version and treat it like production code.
Circular grading — using the exact same model to both generate and grade creates a closed loop with no external signal. At minimum, use a larger or differently-trained judge.

Human-graded pitfalls

Inter-annotator disagreement — two humans reading the same output may score it differently. Measure Cohen's kappa across annotators; below 0.6 means your rubric needs clarification.
Annotation fatigue — quality drops as sessions run long. Keep individual annotation sessions short and rotate annotators.
Selection bias in sampling — humans often review outputs that look interesting or broken rather than a random sample, which skews your aggregate metrics.

Combining graders: the layered approach

Production eval suites rarely use a single grading method. The standard pattern is to layer them: use cheap, fast code checks as the first gate, then model judges for the subset of quality dimensions that need reasoning, and periodic human review to calibrate and audit the model judges. Each layer catches failures the others miss.

// Layered grading pipeline

Model outputraw LLM responseCode checksschema, format, exact match — instant, freeModel judgequality, tone, faithfulness — ~$0.05-0.50/callHuman sample review5-10% random sample — calibration + auditAggregate scoreweighted combination of all layers

A concrete starting point: write code-graded checks for every structural requirement (valid JSON, required fields present, no forbidden patterns). Add a model judge for quality dimensions you care about most (faithfulness, helpfulness, tone). Route 5-10% of outputs to human review each week and compare those human scores to the model judge scores. If the judge drifts from the humans, update the rubric prompt and re-validate before trusting the judge again.

Some teams use a two-stage model-grading setup: a fast, cheap judge (like gpt-5.4-mini or claude-haiku-4-5) runs on every eval in CI, and a slower, more capable judge (like claude-opus-4-8 or gpt-5.5) runs only on the samples that fail or are near the threshold. This keeps costs down while preserving accuracy where it matters most.

Going deeper

Once you have a working layered eval, the next frontier is calibration at scale. The goal is to reduce your dependence on expensive human grading without sacrificing accuracy. The standard approach is to treat calibration as an ongoing process: every week, pull a random sample of outputs, have humans score them, and measure the gap between those scores and your model judge. When the gap exceeds a threshold (say, more than 0.4 points on a 5-point scale), trigger a rubric review.

Fine-tuned judge models

General-purpose LLM judges are fast to set up but imprecise for specialized domains. Teams handling domain-specific outputs (medical notes, legal summaries, financial analyses) often fine-tune a smaller open-source judge model on their own human-labeled dataset. A fine-tuned 8B judge trained on 2,000 labeled examples can outperform a general-purpose 70B judge on the specific rubric it was trained for, at a fraction of the API cost.

Eval harness tooling

Several open-source and commercial frameworks bundle all three grading methods behind a unified API. Braintrust supports code scorers, custom LLM judges, and human annotation queues in a single platform. Langfuse integrates LLM-as-a-judge into its tracing UI so you can attach a score to any production trace. Evidently AI provides a library of pre-built LLM metrics (correctness, faithfulness, toxicity) that are model-graded under the hood. OpenAI Evals (for OpenAI users) supports exact-match graders, model-graded graders, and human review workflows through its grader spec.

The G-Eval pattern

G-Eval, introduced in a 2023 paper from Microsoft Research, is one of the more reliable model-grading patterns. The judge is prompted to first generate step-by-step evaluation criteria from the rubric (chain-of-thought), then produce a probability distribution over scores rather than a single integer. The final score is the expected value of that distribution. This approach is more robust to rubric phrasing variation than direct scoring and correlates more strongly with human judgments on open-ended tasks.

Pairwise vs. pointwise grading

There are two fundamental grading modes for model judges. Pointwise grading scores each output on an absolute scale (1-5). Pairwise grading shows the judge two outputs side-by-side and asks which is better. Pairwise grading correlates more strongly with human preference in research settings — humans also find it easier to say "A is better than B" than to assign an absolute score — but it is quadratically more expensive: comparing N outputs pairwise requires O(N²) judge calls. Pointwise is the default for CI; pairwise is valuable for final release comparisons between two model versions.

FAQ

Can I use the same model to generate and grade outputs?

Technically yes, but it creates a circular signal — the model tends to rate its own outputs more favorably than humans would. This is called self-preference bias. For any serious eval, use a different model family as the judge, or at minimum a larger model from the same family that wasn't used to generate the outputs.

How do I know if my model judge is trustworthy?

Validate it against human labels on a held-out sample from your specific domain. Calculate agreement metrics like Cohen's kappa or Spearman correlation. A kappa above 0.7 indicates substantial agreement and is generally considered sufficient for automated use. Revalidate whenever you change the judge model, rubric prompt, or input distribution.

What is the cheapest way to start evaluating LLM output quality?

Start with code-graded checks for all structural requirements — they are free, run in milliseconds, and can be wired into CI in an afternoon. Add a model judge only for the quality dimensions that matter most and that code cannot check. Run human review as a periodic sample, not on every output.

How many human-labeled examples do I need to validate a model judge?

A sample of 100-200 diverse outputs from your actual distribution is usually enough to get a reliable correlation estimate. For high-stakes domains, aim for 500+. The key is that the sample must be representative — don't only label outputs you think are interesting.

Is exact match ever good enough for LLM evals?

Yes, for tasks with a single objectively correct answer: entity extraction where you know the gold entity, yes/no classification, multiple-choice answers, numeric results, and structured outputs like JSON or SQL where correctness is well-defined. For anything involving natural language phrasing, exact match will produce too many false failures and is the wrong tool.

What is pairwise grading and when should I use it?

Pairwise grading shows a model judge two outputs side-by-side and asks which is better. It correlates more strongly with human preference than absolute pointwise scoring, but costs O(N squared) judge calls. Use it for final comparisons between two model versions before a release decision; use pointwise scoring for everyday CI runs.

// In plain English

// Why it matters

The core tension

// How it works

Code-graded evals

Model-graded evals (LLM-as-a-judge)

Human-graded evals

// When to use each method

Cost and scale in practice

// Pitfalls and known biases

Code-graded pitfalls

Model-graded pitfalls

Human-graded pitfalls

// Combining graders: the layered approach

// Going deeper

Fine-tuned judge models

Eval harness tooling

The G-Eval pattern

Pairwise vs. pointwise grading

// FAQ

// Further reading

// Related

In plain English

Why it matters

How it works

When to use each method

Pitfalls and known biases

Combining graders: the layered approach

Going deeper

FAQ

Further reading

Related