AI/TLDR

LM Evaluation Harness

Run language models against 60+ academic benchmarks with one command

Overview

LM Evaluation Harness is EleutherAI's framework for testing generative language models on a large set of standard tasks. It ships with over 60 academic benchmarks (plus hundreds of subtasks and variants) and gives every model a consistent, tokenization-agnostic interface, so results stay comparable across papers and runs.

It is built for researchers and ML engineers who need to score a model on known datasets like HellaSwag or the Open LLM Leaderboard tasks. You point it at a model backend, name the tasks you want, and it reports the metrics. The same harness is the backend behind Hugging Face's Open LLM Leaderboard and is used internally by organizations including NVIDIA, Cohere, and Mosaic ML.

As a benchmark harness, it sits between your model and the datasets: it loads the model through one of several backends (HuggingFace transformers, vLLM, SGLang, or hosted APIs), formats the prompts, runs few-shot evaluation, and aggregates the scores.

What it does

  • Over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented
  • Multiple model backends installed as extras: HuggingFace transformers, vLLM, SGLang, and commercial APIs such as OpenAI
  • Refactored CLI with run, ls, and validate subcommands plus YAML config files via --config
  • Config-based, Jinja2-driven task creation so you can define and share custom tasks and prompts
  • Support for evaluating PEFT adapters (e.g. LoRA) and quantized models
  • Publicly available prompts for reproducible, comparable results across papers

Getting started

Install the package and the model backend you need, then run an evaluation from the command line. The base install no longer pulls in transformers or torch, so pick a backend extra.

Install from GitHub

Clone the repository and install it in editable mode.

bashbash
git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .

Add a model backend

Model backends are separate optional extras. Install the HuggingFace transformers backend (or vllm / api as needed).

bashbash
pip install "lm_eval[hf]"

List available tasks

Use the ls subcommand to see the benchmarks you can run.

bashbash
lm-eval ls tasks

Run an evaluation

Evaluate a HuggingFace model on a task with the run subcommand. Use lm-eval run -h to see all options.

bashbash
lm-eval run --model hf --model_args pretrained=gpt2 --tasks hellaswag

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Score a new or fine-tuned model on standard benchmarks like HellaSwag before publishing results
  • Reproduce or compare numbers from a paper using the same public prompts and tasks
  • Run the Open LLM Leaderboard task group locally against your own model
  • Evaluate a hosted or API-served model (e.g. via vLLM's OpenAI-compatible endpoint) without changing your tooling

How LM Evaluation Harness compares

LM Evaluation Harness alongside other open-source benchmark harnesses tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
LM Evaluation Harness★ 13kRun language models against 60+ academic benchmarks with one command
OpenCompass★ 7.1kAn LLM evaluation platform that runs models against 100+ datasets covering reasoning, knowledge, coding, and domain tasks, with leaderboards and multi-model support.
SWE-bench★ 5.2kA benchmark and containerized harness that tests whether language models can resolve real GitHub issues by generating patches that pass a repository's tests.
simple-evals★ 4.5kOpenAI's lightweight library for running standard zero-shot, chain-of-thought benchmarks like MMLU, MATH, and GPQA to measure model accuracy.
lmms-eval★ 4.2kAn evaluation suite for large multimodal models that runs image, video, and audio benchmarks across many tasks with a unified, reproducible interface.
AgentBench★ 3.5kA benchmark that evaluates LLMs as agents across diverse interactive environments such as operating systems, databases, web browsing, and games.
HELM★ 2.8kStanford CRFM's Holistic Evaluation of Language Models framework for reproducible, transparent benchmarking of foundation and multimodal models across many scenarios and metrics.
LightEval★ 2.5kHugging Face's toolkit for evaluating LLMs on standard benchmarks across multiple inference backends, with custom task and metric definitions.