Overview
LM Evaluation Harness is EleutherAI's framework for testing generative language models on a large set of standard tasks. It ships with over 60 academic benchmarks (plus hundreds of subtasks and variants) and gives every model a consistent, tokenization-agnostic interface, so results stay comparable across papers and runs.
It is built for researchers and ML engineers who need to score a model on known datasets like HellaSwag or the Open LLM Leaderboard tasks. You point it at a model backend, name the tasks you want, and it reports the metrics. The same harness is the backend behind Hugging Face's Open LLM Leaderboard and is used internally by organizations including NVIDIA, Cohere, and Mosaic ML.
As a benchmark harness, it sits between your model and the datasets: it loads the model through one of several backends (HuggingFace transformers, vLLM, SGLang, or hosted APIs), formats the prompts, runs few-shot evaluation, and aggregates the scores.
What it does
- Over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented
- Multiple model backends installed as extras: HuggingFace transformers, vLLM, SGLang, and commercial APIs such as OpenAI
- Refactored CLI with run, ls, and validate subcommands plus YAML config files via --config
- Config-based, Jinja2-driven task creation so you can define and share custom tasks and prompts
- Support for evaluating PEFT adapters (e.g. LoRA) and quantized models
- Publicly available prompts for reproducible, comparable results across papers
Getting started
Install the package and the model backend you need, then run an evaluation from the command line. The base install no longer pulls in transformers or torch, so pick a backend extra.
Install from GitHub
Clone the repository and install it in editable mode.
git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .Add a model backend
Model backends are separate optional extras. Install the HuggingFace transformers backend (or vllm / api as needed).
pip install "lm_eval[hf]"List available tasks
Use the ls subcommand to see the benchmarks you can run.
lm-eval ls tasksRun an evaluation
Evaluate a HuggingFace model on a task with the run subcommand. Use lm-eval run -h to see all options.
lm-eval run --model hf --model_args pretrained=gpt2 --tasks hellaswagCommands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Score a new or fine-tuned model on standard benchmarks like HellaSwag before publishing results
- Reproduce or compare numbers from a paper using the same public prompts and tasks
- Run the Open LLM Leaderboard task group locally against your own model
- Evaluate a hosted or API-served model (e.g. via vLLM's OpenAI-compatible endpoint) without changing your tooling
How LM Evaluation Harness compares
LM Evaluation Harness alongside other open-source benchmark harnesses tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| LM Evaluation Harness | ★ 13k | Run language models against 60+ academic benchmarks with one command |
| OpenCompass | ★ 7.1k | An LLM evaluation platform that runs models against 100+ datasets covering reasoning, knowledge, coding, and domain tasks, with leaderboards and multi-model support. |
| SWE-bench | ★ 5.2k | A benchmark and containerized harness that tests whether language models can resolve real GitHub issues by generating patches that pass a repository's tests. |
| simple-evals | ★ 4.5k | OpenAI's lightweight library for running standard zero-shot, chain-of-thought benchmarks like MMLU, MATH, and GPQA to measure model accuracy. |
| lmms-eval | ★ 4.2k | An evaluation suite for large multimodal models that runs image, video, and audio benchmarks across many tasks with a unified, reproducible interface. |
| AgentBench | ★ 3.5k | A benchmark that evaluates LLMs as agents across diverse interactive environments such as operating systems, databases, web browsing, and games. |
| HELM | ★ 2.8k | Stanford CRFM's Holistic Evaluation of Language Models framework for reproducible, transparent benchmarking of foundation and multimodal models across many scenarios and metrics. |
| LightEval | ★ 2.5k | Hugging Face's toolkit for evaluating LLMs on standard benchmarks across multiple inference backends, with custom task and metric definitions. |