Overview
LightEval is an open-source toolkit from Hugging Face's Leaderboard and Evals team for evaluating large language models. It runs models on standard benchmarks whether the model is served somewhere remotely or already loaded in memory, and it saves detailed sample-by-sample results so you can debug and compare how models perform.
It is built for ML engineers and researchers who need to measure model quality on known tasks. The library ships with over 1000 evaluation tasks spanning knowledge, math, code, chat, and multilingual domains, including familiar benchmarks like MMLU, GSM8K, IFEval, and GPQA. You can also define your own custom tasks and metrics when the built-in ones don't fit.
As a benchmark harness, LightEval works across several inference backends from one interface: Hugging Face Accelerate for CPU or GPU, vLLM and SGLang for faster GPU serving, Nanotron for distributed settings, and hosted endpoints via TGI, LiteLLM, or inference providers.
What it does
- Over 1000 evaluation tasks across knowledge, math, code, chat, and multilingual domains
- Runs on multiple backends: Accelerate, vLLM, SGLang, Nanotron, and hosted endpoints
- Saves detailed, sample-by-sample results for debugging and model comparison
- CLI entry points (lighteval eval, accelerate, vllm, sglang, endpoint) plus a Python API
- Custom task and custom metric definitions when built-in ones don't fit
- Optional push of results to the Hugging Face Hub
Getting started
Install LightEval with pip, then run an evaluation from the command line or the Python API. Note: the project supports Mac and Linux; it is currently untested on Windows.
Install
Install the package from PyPI. Many optional extras are available for specific backends.
pip install lightevalLog in to the Hugging Face Hub (optional)
Only needed if you want to push results to the Hub. Add your access token via the CLI.
hf auth loginRun an evaluation from the CLI
Evaluate a model on a benchmark using a remote inference service.
lighteval eval "hf-inference-providers/openai/gpt-oss-20b" gpqa:diamondOr use the Python API
Evaluate a model already loaded in memory by building a Pipeline.
from transformers import AutoModelForCausalLM
from lighteval.logging.evaluation_tracker import EvaluationTracker
from lighteval.models.transformers.transformers_model import TransformersModel, TransformersModelConfig
from lighteval.pipeline import ParallelismManager, Pipeline, PipelineParameters
MODEL_NAME = "meta-llama/Meta-Llama-3-8B-Instruct"
BENCHMARKS = "gsm8k"
evaluation_tracker = EvaluationTracker(output_dir="./results")
pipeline_params = PipelineParameters(
launcher_type=ParallelismManager.NONE,
max_samples=2
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME, device_map="auto"
)
config = TransformersModelConfig(model_name=MODEL_NAME, batch_size=1)
model = TransformersModel.from_model(model, config)
pipeline = Pipeline(
model=model,
pipeline_parameters=pipeline_params,
evaluation_tracker=evaluation_tracker,
tasks=BENCHMARKS,
)
results = pipeline.evaluate()
pipeline.show_results()
results = pipeline.get_results()Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Benchmark a fine-tuned model on tasks like MMLU or GSM8K before release
- Compare several models on the same benchmark suite using one backend interface
- Inspect sample-by-sample outputs to debug where a model fails on a task
- Define a custom task or metric to evaluate a model on domain-specific data
How LightEval compares
LightEval alongside other open-source benchmark harnesses tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| LM Evaluation Harness | ★ 13k | EleutherAI's framework for few-shot evaluation of language models across 60+ academic benchmarks, used as the backend for many leaderboards. |
| OpenCompass | ★ 7.1k | An LLM evaluation platform that runs models against 100+ datasets covering reasoning, knowledge, coding, and domain tasks, with leaderboards and multi-model support. |
| SWE-bench | ★ 5.2k | A benchmark and containerized harness that tests whether language models can resolve real GitHub issues by generating patches that pass a repository's tests. |
| simple-evals | ★ 4.5k | OpenAI's lightweight library for running standard zero-shot, chain-of-thought benchmarks like MMLU, MATH, and GPQA to measure model accuracy. |
| lmms-eval | ★ 4.2k | An evaluation suite for large multimodal models that runs image, video, and audio benchmarks across many tasks with a unified, reproducible interface. |
| AgentBench | ★ 3.5k | A benchmark that evaluates LLMs as agents across diverse interactive environments such as operating systems, databases, web browsing, and games. |
| HELM | ★ 2.8k | Stanford CRFM's Holistic Evaluation of Language Models framework for reproducible, transparent benchmarking of foundation and multimodal models across many scenarios and metrics. |
| LightEval | ★ 2.5k | All-in-one toolkit for evaluating LLMs across multiple inference backends |