LM Evaluation Harness

Run language models against 60+ academic benchmarks with one command

github.com/EleutherAI/lm-evaluation-harness★ 13k eleuther.ai/projects/large-language-model-evaluation

Overview

LM Evaluation Harness is EleutherAI's framework for testing generative language models on a large set of standard tasks. It ships with over 60 academic benchmarks (plus hundreds of subtasks and variants) and gives every model a consistent, tokenization-agnostic interface, so results stay comparable across papers and runs.

It is built for researchers and ML engineers who need to score a model on known datasets like HellaSwag or the Open LLM Leaderboard tasks. You point it at a model backend, name the tasks you want, and it reports the metrics. The same harness is the backend behind Hugging Face's Open LLM Leaderboard and is used internally by organizations including NVIDIA, Cohere, and Mosaic ML.

As a benchmark harness, it sits between your model and the datasets: it loads the model through one of several backends (HuggingFace transformers, vLLM, SGLang, or hosted APIs), formats the prompts, runs few-shot evaluation, and aggregates the scores.

What it does

Over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented
Multiple model backends installed as extras: HuggingFace transformers, vLLM, SGLang, and commercial APIs such as OpenAI
Refactored CLI with run, ls, and validate subcommands plus YAML config files via --config
Config-based, Jinja2-driven task creation so you can define and share custom tasks and prompts
Support for evaluating PEFT adapters (e.g. LoRA) and quantized models
Publicly available prompts for reproducible, comparable results across papers

Getting started

Install the package and the model backend you need, then run an evaluation from the command line. The base install no longer pulls in transformers or torch, so pick a backend extra.

Install from GitHub

Clone the repository and install it in editable mode.

bashbash

git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .

Add a model backend

Model backends are separate optional extras. Install the HuggingFace transformers backend (or vllm / api as needed).

bashbash

pip install "lm_eval[hf]"

List available tasks

Use the ls subcommand to see the benchmarks you can run.

bashbash

lm-eval ls tasks

Run an evaluation

Evaluate a HuggingFace model on a task with the run subcommand. Use lm-eval run -h to see all options.

bashbash

lm-eval run --model hf --model_args pretrained=gpt2 --tasks hellaswag

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Score a new or fine-tuned model on standard benchmarks like HellaSwag before publishing results
Reproduce or compare numbers from a paper using the same public prompts and tasks
Run the Open LLM Leaderboard task group locally against your own model
Evaluate a hosted or API-served model (e.g. via vLLM's OpenAI-compatible endpoint) without changing your tooling

How LM Evaluation Harness compares

LM Evaluation Harness alongside other open-source benchmark harnesses tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
LM Evaluation Harness	★ 13k	Run language models against 60+ academic benchmarks with one command
OpenCompass	★ 7.1k	An LLM evaluation platform that runs models against 100+ datasets covering reasoning, knowledge, coding, and domain tasks, with leaderboards and multi-model support.
SWE-bench	★ 5.2k	A benchmark and containerized harness that tests whether language models can resolve real GitHub issues by generating patches that pass a repository's tests.
simple-evals	★ 4.5k	OpenAI's lightweight library for running standard zero-shot, chain-of-thought benchmarks like MMLU, MATH, and GPQA to measure model accuracy.
lmms-eval	★ 4.2k	An evaluation suite for large multimodal models that runs image, video, and audio benchmarks across many tasks with a unified, reproducible interface.
AgentBench	★ 3.5k	A benchmark that evaluates LLMs as agents across diverse interactive environments such as operating systems, databases, web browsing, and games.
HELM	★ 2.8k	Stanford CRFM's Holistic Evaluation of Language Models framework for reproducible, transparent benchmarking of foundation and multimodal models across many scenarios and metrics.
LightEval	★ 2.5k	Hugging Face's toolkit for evaluating LLMs on standard benchmarks across multiple inference backends, with custom task and metric definitions.

// Overview

// What it does

// Getting started

Install from GitHub

Add a model backend

List available tasks

Run an evaluation

// When to use it

// How LM Evaluation Harness compares

Overview

What it does

Getting started

When to use it

How LM Evaluation Harness compares