In plain English
lm-evaluation-harness is an open-source tool from EleutherAI that runs academic benchmarks on a language model for you. You point it at a model and a list of benchmark tasks — say a multiple-choice knowledge test or a reading-comprehension set — and it asks every question, collects the model's answers, scores them, and prints a number. Nothing is hand-wired; the framework already knows how each benchmark is supposed to be run.

Think of it as a standardized exam center. Anyone can claim their student is smart, but a claim only means something if everyone sat the same paper, under the same time limit, marked by the same rubric. The harness is that exam center for AI models: it owns the question papers, the seating rules, and the marking scheme, so that a score from one lab can be compared honestly against a score from another.
The reason this matters is subtle. A benchmark is not just a pile of questions — it is also a hundred tiny decisions about how you ask them: the exact wording of the prompt, how many worked examples you show first, whether you read the answer from the model's text or from its internal probabilities, and how you decide a response is "correct." Change any of those and the score moves. The harness freezes all of it into reusable code so the same task runs the same way every time, for every model.
Why it matters
Before a shared harness existed, every research group wrote its own evaluation code. The benchmark names matched, but the implementations quietly didn't — and that one fact poisoned a lot of model comparisons.
- Reproducibility. A score is only trustworthy if someone else can re-run it and get the same answer. A shared, versioned harness lets a third party reproduce a reported number instead of trusting a screenshot.
- Apples-to-apples comparison. When two models are run through the identical prompt template, few-shot setup, and scoring logic, a gap between their scores reflects the models — not a difference in how each team happened to phrase the question.
- No silent prompt advantage. It is easy, even by accident, to write a prompt that flatters your own model. Pinning the prompt in shared code removes that thumb on the scale.
- Coverage without reinventing the wheel. The harness ships hundreds of ready-made task definitions, so you can evaluate on a well-known benchmark in one command rather than re-implementing its quirks from a paper.
Who cares about this? Model builders who need credible numbers in a release. Researchers comparing a new fine-tune against a baseline. Engineers choosing a base model and wanting to sanity-check the leaderboard claims themselves. And anyone who has been burned by two "MMLU scores" that turned out to be measured completely differently.
It is worth being clear about what the harness is not. It runs academic benchmarks — fixed, public question sets that probe general capability. It is not a tool for testing your product against your data; for that you want application-level LLM evals. The two are complementary: benchmarks tell you how capable a model is in general, evals tell you whether your specific app actually works.
How it works
Under the hood the harness is a loop over a benchmark's questions, wrapped in code that handles the fiddly parts consistently. A run has four moving pieces: the model (what you are testing), the task (a benchmark definition), the request type (how the model is queried), and the metric (how answers become a score).
Tasks: the frozen recipe for a benchmark
Each benchmark is described by a task — usually a small YAML file plus the dataset. The task pins down everything that would otherwise drift: which dataset and split to load, the exact prompt template that turns a raw row into a question, how many few-shot examples to prepend, and which metric to compute. Because the recipe lives in shared code, "run benchmark X" means the same thing for everyone.
Two ways to ask: scoring choices vs free generation
A key idea is that there is more than one way to "ask" a model a question, and the harness supports the two that matter. For a multiple-choice question it usually does not let the model ramble — instead it measures the model's log-likelihood (its assigned probability) for each candidate answer and picks the highest. That is robust: the model can't lose a correct answer just by formatting it oddly. For open-ended tasks it instead lets the model generate text, then checks that text against the gold answer (exact match, a normalized comparison, or a task-specific check).
- Score each candidate answer
- Pick the highest-probability option
- No parsing of free text needed
- Robust to formatting quirks
- Model writes a free-text answer
- Compare against the gold answer
- Exact / normalized / task check
- Sensitive to how output is parsed
Few-shot and the prompt template
Many benchmarks are run few-shot: before the real question, the harness prepends a fixed number of solved examples so the model learns the answer format from context. The count (often written as num_fewshot) is part of the task recipe, because a model's score at 0-shot and 5-shot can differ a lot. Pinning it is exactly why two harness runs are comparable and two ad-hoc scripts often are not.
# Evaluate a model on a couple of tasks, 5 worked examples each.
lm_eval \
--model hf \
--model_args pretrained=<your-model> \
--tasks task_a,task_b \
--num_fewshot 5 \
--batch_size autoThe --model flag is a pluggable backend: a local Hugging Face model, an inference server, or a hosted API all expose the same interface to the harness, so the same task definition can score any of them. That separation — tasks on one side, model backends on the other — is what lets the harness compare wildly different models on identical questions.
What the harness actually fixes
It helps to see the difference between an ad-hoc evaluation script and running the same benchmark through the harness. The questions are the same; the discipline is not.
| Decision | Hand-rolled script | lm-evaluation-harness |
|---|---|---|
| Prompt wording | Whatever the author typed | Fixed in the shared task |
| Few-shot count | Often unstated | Declared (num_fewshot) |
| Reading the answer | Custom regex per script | Standard scoring per request type |
| Metric definition | Re-implemented from the paper | One canonical implementation |
| Reproducible by others | Rarely | By design, with a version pin |
When to reach for it (and when not to)
Good fits
- Comparing base models before you build on one — run the same standard tasks across candidates and read the gap.
- Checking a fine-tune for regressions on general capability after you train it on your own data.
- Reproducing a published number to confirm a release's leaderboard claim instead of trusting it.
- Reporting credible benchmark results for a model you are releasing, so reviewers can re-run them.
Poor fits
- Testing your product's behavior on your real prompts and documents — that is application-level evaluation, where you write your own eval suite.
- Judging open-ended quality like tone or helpfulness, where a fixed gold answer doesn't exist and an LLM-as-a-judge or human review fits better.
- Measuring latency, cost, or production reliability — the harness scores correctness, not operations.
Going deeper
Once the basics click, a few realities of benchmark-running are worth knowing — they explain a lot of confusing leaderboard arguments.
Versioning is not a footnote. Tasks in the harness are versioned, and a fix to a prompt or a dataset can shift scores between versions. A responsible result cites the harness version and the task version, not just "MMLU = X." Treat any score without that context as approximate.
Loglikelihood scoring has limits. Picking the highest-probability multiple-choice option is robust, but it measures something narrower than "can the model actually solve this." A model might rank the right option highest yet fail to produce a clean answer when generating freely. For reasoning-heavy tasks, generation-based scoring (often with chain-of-thought) tells a fuller story — at the cost of being more sensitive to how you parse the output.
Contamination is the quiet killer. Public benchmarks leak into training data, so a high score can reflect memorization rather than skill — the model may have seen the test. The harness runs the questions faithfully; it cannot tell you whether the model studied the answer key in advance. That is why fresh, private, or held-out evaluations matter alongside public benchmarks.
Custom tasks are the real power. Because a task is just a config plus a dataset, you can add your own benchmark in the same framework and get the harness's consistent prompting and scoring for free. That is the bridge from running standard academic tests to building a reproducible, in-house evaluation that still benefits from a battle-tested runner.
Where to go next: learn how application-level LLM evals differ from academic benchmarks, how a golden dataset anchors your own tests, and how the line between code-graded and model-graded scoring shapes what you can measure. The durable lesson from the harness is simple: a benchmark number is meaningless without the recipe that produced it — so always ship the recipe with the score.
FAQ
What is lm-evaluation-harness used for?
It runs standardized academic benchmarks on language models in a consistent, reproducible way. You give it a model and a list of benchmark tasks, and it builds the prompts, queries the model, scores the answers, and reports a number — using the same prompt template, few-shot setup, and scoring rules every time so results are comparable across models.
Who makes lm-evaluation-harness?
It is an open-source project from EleutherAI, a non-profit AI research group. Because it is open and shared, it has become the de facto reference runner behind many public LLM leaderboards and model-release benchmark numbers.
Why do two MMLU scores for the same model sometimes differ?
Because a benchmark score depends on more than the questions: the prompt wording, the number of few-shot examples, how the answer is read from the model, and the metric all affect it. Two scores only match when those settings — and the harness/task version — match. A shared harness exists precisely to remove that ambiguity.
Is lm-evaluation-harness the same as a leaderboard?
No. The harness is the engine that computes scores; a leaderboard is a published ranking of those scores. Many leaderboards run models through the harness, but the harness itself just produces numbers — it does not host or rank them.
Can I use it to test my own LLM application?
It is built for fixed academic benchmarks, not your product's real prompts and data. You can add a custom task to evaluate your own dataset within the framework, but for product behavior most teams write an application-level eval suite instead. Use the harness for general capability and your own evals for app fitness.
What is few-shot evaluation in the harness?
Few-shot means prepending a fixed number of solved examples before the real question so the model learns the answer format from context. The count is part of each task's recipe, because the same model can score very differently at 0-shot versus 5-shot — pinning it is what makes runs comparable.