What Is lm-evaluation-harness? Benchmark Runner

Q: Who makes lm-evaluation-harness?

It is an open-source project from **EleutherAI**, a non-profit AI research group. Because it is open and shared, it has become the de facto reference runner behind many public LLM leaderboards and model-release benchmark numbers.

Q: Is lm-evaluation-harness the same as a leaderboard?

No. The harness is the *engine* that computes scores; a leaderboard is a published ranking of those scores. Many leaderboards run models through the harness, but the harness itself just produces numbers — it does not host or rank them.

You will understand why lm-evaluation-harness is the de facto standard for running academic LLM benchmarks reproducibly.

INTERMEDIATE9 MIN READUPDATED 2026-06-14

EleutherAI/lm-evaluation-harness13k hendrycks/test1.6k

In plain English

lm-evaluation-harness is an open-source tool from EleutherAI that runs academic benchmarks on a language model for you. You point it at a model and a list of benchmark tasks — say a multiple-choice knowledge test or a reading-comprehension set — and it asks every question, collects the model's answers, scores them, and prints a number. Nothing is hand-wired; the framework already knows how each benchmark is supposed to be run.

lm-evaluation-harness — illustration — lm-evaluation-harness — cdn.sanity.io

Think of it as a standardized exam center. Anyone can claim their student is smart, but a claim only means something if everyone sat the same paper, under the same time limit, marked by the same rubric. The harness is that exam center for AI models: it owns the question papers, the seating rules, and the marking scheme, so that a score from one lab can be compared honestly against a score from another.

The reason this matters is subtle. A benchmark is not just a pile of questions — it is also a hundred tiny decisions about how you ask them: the exact wording of the prompt, how many worked examples you show first, whether you read the answer from the model's text or from its internal probabilities, and how you decide a response is "correct." Change any of those and the score moves. The harness freezes all of it into reusable code so the same task runs the same way every time, for every model.

Why it matters

Before a shared harness existed, every research group wrote its own evaluation code. The benchmark names matched, but the implementations quietly didn't — and that one fact poisoned a lot of model comparisons.

Reproducibility. A score is only trustworthy if someone else can re-run it and get the same answer. A shared, versioned harness lets a third party reproduce a reported number instead of trusting a screenshot.
Apples-to-apples comparison. When two models are run through the identical prompt template, few-shot setup, and scoring logic, a gap between their scores reflects the models — not a difference in how each team happened to phrase the question.
No silent prompt advantage. It is easy, even by accident, to write a prompt that flatters your own model. Pinning the prompt in shared code removes that thumb on the scale.
Coverage without reinventing the wheel. The harness ships hundreds of ready-made task definitions, so you can evaluate on a well-known benchmark in one command rather than re-implementing its quirks from a paper.

Who cares about this? Model builders who need credible numbers in a release. Researchers comparing a new fine-tune against a baseline. Engineers choosing a base model and wanting to sanity-check the leaderboard claims themselves. And anyone who has been burned by two "MMLU scores" that turned out to be measured completely differently.

It is worth being clear about what the harness is not. It runs academic benchmarks — fixed, public question sets that probe general capability. It is not a tool for testing your product against your data; for that you want application-level LLM evals. The two are complementary: benchmarks tell you how capable a model is in general, evals tell you whether your specific app actually works.

How it works

Under the hood the harness is a loop over a benchmark's questions, wrapped in code that handles the fiddly parts consistently. A run has four moving pieces: the model (what you are testing), the task (a benchmark definition), the request type (how the model is queried), and the metric (how answers become a score).

// One harness run, end to end

Taskbenchmark + prompt templateBuild promptsadd few-shot examplesQuery modelloglikelihood or generateScorecompare to gold answerAggregatemetric per task

Tasks: the frozen recipe for a benchmark

Each benchmark is described by a task — usually a small YAML file plus the dataset. The task pins down everything that would otherwise drift: which dataset and split to load, the exact prompt template that turns a raw row into a question, how many few-shot examples to prepend, and which metric to compute. Because the recipe lives in shared code, "run benchmark X" means the same thing for everyone.

Two ways to ask: scoring choices vs free generation

A key idea is that there is more than one way to "ask" a model a question, and the harness supports the two that matter. For a multiple-choice question it usually does not let the model ramble — instead it measures the model's log-likelihood (its assigned probability) for each candidate answer and picks the highest. That is robust: the model can't lose a correct answer just by formatting it oddly. For open-ended tasks it instead lets the model generate text, then checks that text against the gold answer (exact match, a normalized comparison, or a task-specific check).

// Two request types, two scoring styles

Loglikelihood (multiple-choice)

Score each candidate answer
Pick the highest-probability option
No parsing of free text needed
Robust to formatting quirks

Generate (open-ended)

Model writes a free-text answer
Compare against the gold answer
Exact / normalized / task check
Sensitive to how output is parsed

Few-shot and the prompt template

Many benchmarks are run few-shot: before the real question, the harness prepends a fixed number of solved examples so the model learns the answer format from context. The count (often written as num_fewshot) is part of the task recipe, because a model's score at 0-shot and 5-shot can differ a lot. Pinning it is exactly why two harness runs are comparable and two ad-hoc scripts often are not.

the shape of a typical runbash

# Evaluate a model on a couple of tasks, 5 worked examples each.
lm_eval \
  --model hf \
  --model_args pretrained=<your-model> \
  --tasks task_a,task_b \
  --num_fewshot 5 \
  --batch_size auto

The --model flag is a pluggable backend: a local Hugging Face model, an inference server, or a hosted API all expose the same interface to the harness, so the same task definition can score any of them. That separation — tasks on one side, model backends on the other — is what lets the harness compare wildly different models on identical questions.

What the harness actually fixes

It helps to see the difference between an ad-hoc evaluation script and running the same benchmark through the harness. The questions are the same; the discipline is not.

Decision	Hand-rolled script	lm-evaluation-harness
Prompt wording	Whatever the author typed	Fixed in the shared task
Few-shot count	Often unstated	Declared (`num_fewshot`)
Reading the answer	Custom regex per script	Standard scoring per request type
Metric definition	Re-implemented from the paper	One canonical implementation
Reproducible by others	Rarely	By design, with a version pin

When to reach for it (and when not to)

Good fits

Comparing base models before you build on one — run the same standard tasks across candidates and read the gap.
Checking a fine-tune for regressions on general capability after you train it on your own data.
Reproducing a published number to confirm a release's leaderboard claim instead of trusting it.
Reporting credible benchmark results for a model you are releasing, so reviewers can re-run them.

Poor fits

Testing your product's behavior on your real prompts and documents — that is application-level evaluation, where you write your own eval suite.
Judging open-ended quality like tone or helpfulness, where a fixed gold answer doesn't exist and an LLM-as-a-judge or human review fits better.
Measuring latency, cost, or production reliability — the harness scores correctness, not operations.

Going deeper

Once the basics click, a few realities of benchmark-running are worth knowing — they explain a lot of confusing leaderboard arguments.

Versioning is not a footnote. Tasks in the harness are versioned, and a fix to a prompt or a dataset can shift scores between versions. A responsible result cites the harness version and the task version, not just "MMLU = X." Treat any score without that context as approximate.

Loglikelihood scoring has limits. Picking the highest-probability multiple-choice option is robust, but it measures something narrower than "can the model actually solve this." A model might rank the right option highest yet fail to produce a clean answer when generating freely. For reasoning-heavy tasks, generation-based scoring (often with chain-of-thought) tells a fuller story — at the cost of being more sensitive to how you parse the output.

Contamination is the quiet killer. Public benchmarks leak into training data, so a high score can reflect memorization rather than skill — the model may have seen the test. The harness runs the questions faithfully; it cannot tell you whether the model studied the answer key in advance. That is why fresh, private, or held-out evaluations matter alongside public benchmarks.

Custom tasks are the real power. Because a task is just a config plus a dataset, you can add your own benchmark in the same framework and get the harness's consistent prompting and scoring for free. That is the bridge from running standard academic tests to building a reproducible, in-house evaluation that still benefits from a battle-tested runner.

Where to go next: learn how application-level LLM evals differ from academic benchmarks, how a golden dataset anchors your own tests, and how the line between code-graded and model-graded scoring shapes what you can measure. The durable lesson from the harness is simple: a benchmark number is meaningless without the recipe that produced it — so always ship the recipe with the score.

FAQ

What is lm-evaluation-harness used for?

It runs standardized academic benchmarks on language models in a consistent, reproducible way. You give it a model and a list of benchmark tasks, and it builds the prompts, queries the model, scores the answers, and reports a number — using the same prompt template, few-shot setup, and scoring rules every time so results are comparable across models.

Who makes lm-evaluation-harness?

It is an open-source project from EleutherAI, a non-profit AI research group. Because it is open and shared, it has become the de facto reference runner behind many public LLM leaderboards and model-release benchmark numbers.

Why do two MMLU scores for the same model sometimes differ?

Because a benchmark score depends on more than the questions: the prompt wording, the number of few-shot examples, how the answer is read from the model, and the metric all affect it. Two scores only match when those settings — and the harness/task version — match. A shared harness exists precisely to remove that ambiguity.

Is lm-evaluation-harness the same as a leaderboard?

No. The harness is the engine that computes scores; a leaderboard is a published ranking of those scores. Many leaderboards run models through the harness, but the harness itself just produces numbers — it does not host or rank them.

Can I use it to test my own LLM application?

It is built for fixed academic benchmarks, not your product's real prompts and data. You can add a custom task to evaluate your own dataset within the framework, but for product behavior most teams write an application-level eval suite instead. Use the harness for general capability and your own evals for app fitness.

What is few-shot evaluation in the harness?

Few-shot means prepending a fixed number of solved examples before the real question so the model learns the answer format from context. The count is part of each task's recipe, because the same model can score very differently at 0-shot versus 5-shot — pinning it is what makes runs comparable.

// In plain English

// Why it matters

// How it works

Tasks: the frozen recipe for a benchmark

Two ways to ask: scoring choices vs free generation

Few-shot and the prompt template

// What the harness actually fixes

// When to reach for it (and when not to)

Good fits

Poor fits

// Going deeper

// FAQ

// Further reading

// Related

In plain English

Why it matters

How it works

What the harness actually fixes

When to reach for it (and when not to)

Going deeper

FAQ

Further reading

Related