AI/TLDR

Reference-Free vs Reference-Based Evaluation Explained

You'll understand the difference between grading against a known answer and judging quality with no reference, and when each is appropriate.

INTERMEDIATE10 MIN READUPDATED 2026-06-13

In plain English

When you grade an AI's output, you face one basic question: do you already have the right answer to compare against? That single fork splits evaluation into two families.

Reference-Free vs Reference-Based — illustration
Reference-Free vs Reference-Based — video-meta.open.video

Reference-based evaluation is like marking a math test with the answer key in hand. You have a known-correct answer — the reference (also called the gold answer or ground truth) — and you check how close the model's output lands to it. The grade flows directly from that comparison.

Reference-free evaluation is like a hiring panel judging an essay where there is no single correct version. Nobody hands you a model answer. Instead you ask: is this clear, accurate, on-topic, well-written? You judge the output on its own merits against a standard of quality, not against a stored answer.

Both can be run by a human, by code, or by an LLM-as-a-judge. The reference-based / reference-free split is about what you compare to, not who does the grading.

Why it matters

Pick the wrong family and your evaluation quietly lies to you. Two failure modes are common, and they point in opposite directions.

Forcing a reference where none exists. Open-ended tasks — summaries, chat replies, brainstorms, rewrites, code that can be written ten valid ways — have no single right answer. If you score them by string-matching to one gold answer, a better response that happens to use different words gets marked wrong. You end up optimizing your model to imitate one phrasing instead of being good.

Going reference-free where a ground truth exists. For a math problem, a SQL query, or "what is the capital of France," there is a correct answer. Asking a judge model to vibe-check quality with no reference throws away the cheapest, most reliable signal you have. A == comparison is faster, free, and never has an opinion of its own.

Why a builder cares: most real products mix both. A support assistant must retrieve the correct policy (closed, reference-based) and phrase it helpfully (open, reference-free). Knowing which half you are measuring tells you which metric to trust, where bias can creep in, and what to fix when the score moves.

  • Trust. Reference-based scores are reproducible and auditable — anyone can see the gold answer. Reference-free scores carry the judge's opinions and biases, so they need more validation before you believe them.
  • Cost and speed. Comparing to a reference is often a cheap string or code check. Reference-free grading usually means calling a model on every output, which costs tokens and time.
  • Coverage. Reference-based only works for the tasks you wrote answers for. Reference-free scales to open tasks and to inputs you never anticipated.

How it works

The two pipelines share a model output but diverge the moment grading starts. One pulls in a stored gold answer; the other pulls in a quality rubric.

The reference-based path

You need a golden dataset: inputs paired with known-correct outputs. At eval time you run the model on each input and compare its answer to the stored reference. The comparison can be strict or fuzzy:

  • Exact match — output must equal the reference character-for-character. Great for labels, IDs, multiple-choice letters, normalized numbers.
  • Programmatic check — run the code, execute the SQL, parse the JSON, check the number is within tolerance. This is code-graded evaluation, and it is the gold standard when it applies.
  • Similarity metrics — overlap or embedding-distance scores (BLEU, ROUGE, F1, BERTScore) that allow different wording but still anchor to the reference. Useful when many phrasings are acceptable but there is still a target meaning.
  • Reference-guided LLM judge — give the judge the output and the gold answer and ask "is this output consistent with the reference?" The reference keeps the judge honest.

The reference-free path

Here there is no stored answer, so you replace the reference with an explicit definition of good. Usually a judge — human or LLM — reads the input and the output and scores it against named criteria: faithfulness to the source, helpfulness, relevance, tone, format, safety. The criteria are the standard; without them, "quality" is undefined and the score is noise.

A reference-free judge prompt has to carry weight the reference would otherwise carry. Compare the two prompt shapes:

reference-based vs reference-free judge promptstext
# REFERENCE-BASED — the gold answer does the heavy lifting
Question: {question}
Gold answer: {reference}
Model answer: {output}
Is the model answer consistent with the gold answer? yes / no

# REFERENCE-FREE — the rubric must define every standard
Question: {question}
Model answer: {output}
Score the answer 1-5 on EACH criterion, with a one-line reason:
- Faithful: only claims supported by the question/context
- Relevant: actually answers what was asked
- Clear: a non-expert could follow it

Side by side: which fits your task

The deciding question is almost always: does a single correct answer exist? If yes, lean reference-based. If the task is open-ended with many valid outputs, you usually have no choice but reference-free.

Reference-basedReference-free
Needs a gold answerYes — a curated datasetNo
Best forClosed tasks: QA, math, classification, code, extractionOpen tasks: summaries, chat, rewrites, creative, style
Typical graderExact/code check or similarity metricLLM-as-judge or human rater
Cost per itemOften free / very cheapA model call per output
ReproducibleHigh — same answer, same scoreLower — judge opinion can drift
Main weaknessPenalizes valid alternative wordingsJudge bias; harder to trust
Update costRe-label data when truth changesRe-tune the rubric / prompt

A subtle middle ground: reference-guided grading. It is reference-based in spirit — a gold answer exists — but uses an LLM judge instead of exact match so that different-but-equivalent answers still pass. It buys flexibility while keeping the anchor, at the price of a model call. Many production eval suites live here.

A worked example

Say you are evaluating a model that answers customer questions from a help center. One test case has two layers, and each layer wants a different family.

Layer 1 — the fact (reference-based). Question: "How many days do I have to return a physical item?" The policy says 30. There is exactly one right answer, so you store 30 as the reference and check it programmatically. No judge needed, no ambiguity.

reference-based check — cheap and exactpython
def grade_fact(output: str, reference: str) -> bool:
    # normalize then compare; truth is known, so no model call
    return reference.strip().lower() in output.strip().lower()

assert grade_fact("You have 30 days to return it.", "30")  # passes

Layer 2 — the explanation (reference-free). Was the reply clear, polite, and did it avoid inventing a policy that isn't in the docs? There is no single "correct" sentence, so you hand the input and output to an LLM judge with a rubric and let it score.

reference-free check — judge against a rubricpython
JUDGE = '''You are grading a support reply. Score 1-5 on EACH, with a reason.
- Faithful: every claim is supported by the provided policy text
- Helpful: directly answers the customer's question
- Tone: polite and professional
Return JSON: {"faithful": n, "helpful": n, "tone": n, "reason": "..."}'''

def grade_quality(question, policy, output, judge):
    prompt = f"{JUDGE}\n\nPolicy:\n{policy}\n\nQuestion: {question}\nReply: {output}"
    return judge(prompt)  # an LLM call -> parsed scores

Same test case, two graders. The fact is cheap and certain; the quality is fuzzy and needs judgment. Mixing them — exact check for what is knowable, rubric judge for what is not — is the normal shape of a real eval suite.

Common pitfalls

Most evaluation mistakes are really category mistakes — using the wrong family for the task, or trusting a reference-free score as if it were a reference-based one.

  • A single gold answer for an open task. Marking a summary wrong because it didn't match your one reference phrasing punishes good outputs. If many answers are valid, either store several references or switch to reference-free.
  • Treating a reference-free score as ground truth. It is the judge's opinion, not a fact. Before you trust it, validate the judge against human labels and watch for systematic biases like length, position, or self-preference.
  • Stale references. A gold dataset rots. When the policy, API, or correct answer changes and the reference doesn't, reference-based evals confidently mark the new correct answer as wrong.
  • Vague rubrics. "Rate the quality 1-10" with no definition produces noise. Reference-free only works when each criterion is concrete enough that two judges would agree.
  • Same model grades itself. A reference-free judge that is the same model (or family) you are testing can inflate its own outputs. A reference keeps that bias in check; without one, use a separate, strong judge model.

Going deeper

Once the two families click, the interesting work is in combining and stress-testing them.

The spectrum, not a binary. Real evals run from fully reference-based (exact match) → reference-guided (gold answer + LLM judge) → fully reference-free (rubric only). Moving right buys flexibility and coverage and costs you reproducibility and trust. Pick the leftmost point on that line that still captures what you care about — never go more reference-free than the task forces you to.

Reference-free meets pairwise. A clean way to dodge the "no gold answer" problem without a rubric is to compare two outputs and ask which is better. That is still reference-free (no gold answer), but humans and judges find relative judgments easier and more stable than absolute scores — see pairwise vs rubric judging. Large-scale human pairwise voting is exactly how Chatbot Arena ranks models.

Faithfulness as a hybrid. In retrieval systems, a common reference-free-ish check is groundedness: does every claim in the answer trace back to the retrieved context? There is no gold answer, but the retrieved documents act as a constraint the judge checks against. It is reference-free on the final answer yet anchored by source text — a practical middle path that resists hallucinated scores.

Validate your judge like a model. A reference-free judge is itself a system that can be right or wrong. Build a small set of items you have labeled by hand, then measure how often the judge agrees with you. If agreement is low, fix the rubric, swap the judge model, or fall back to reference-based where you can. The durable rule: a reference-free score is only as trustworthy as the validation you did on the judge that produced it.

FAQ

What is the difference between reference-based and reference-free evaluation?

Reference-based evaluation compares the model's output to a known-correct answer (a gold or reference answer) and scores how close it lands. Reference-free evaluation has no stored correct answer; instead it judges the output's quality against explicit criteria like accuracy, relevance, and clarity. In short: reference-based asks does it match the answer?, reference-free asks is it good?

When should I use reference-free evaluation?

Use it for open-ended tasks where no single correct answer exists — summaries, chat replies, rewrites, creative writing, or brainstorming. For closed tasks with one right answer (math, classification, code, factual QA) prefer reference-based grading, which is cheaper and more reproducible.

Can you do LLM-as-a-judge without a reference answer?

Yes — that is exactly reference-free judging. You give the judge the input and output plus a rubric of named criteria, and it scores each one. The catch is that all the standards now live in the prompt rather than in a gold answer, so the rubric must be concrete, and you should validate the judge against human labels before trusting it.

Why is reference-free evaluation considered less trustworthy?

Because the score reflects a judge's opinion instead of a verifiable fact. LLM judges can carry biases — favoring longer answers, the first option, or their own outputs — and a fluent, confident, wrong answer can score well. A reference answer anchors the grade to ground truth; without one, you need extra validation to believe the numbers.

What is reference-guided evaluation?

It is a middle ground: a gold answer exists, but instead of exact matching you give the judge both the output and the reference and ask whether they are consistent. This allows different-but-equivalent phrasings to pass while still anchoring the grade to a known-correct answer. It costs a model call but is more robust than strict matching.

Is exact match a reference-based or reference-free method?

Exact match is reference-based — it compares the output character-for-character (or after light normalization) to a stored gold answer. It is the cheapest, most reproducible grader but only works when one canonical answer exists, such as labels, IDs, or multiple-choice letters.

Further reading