LightEval

All-in-one toolkit for evaluating LLMs across multiple inference backends

Overview

LightEval is an open-source toolkit from Hugging Face's Leaderboard and Evals team for evaluating large language models. It runs models on standard benchmarks whether the model is served somewhere remotely or already loaded in memory, and it saves detailed sample-by-sample results so you can debug and compare how models perform.

It is built for ML engineers and researchers who need to measure model quality on known tasks. The library ships with over 1000 evaluation tasks spanning knowledge, math, code, chat, and multilingual domains, including familiar benchmarks like MMLU, GSM8K, IFEval, and GPQA. You can also define your own custom tasks and metrics when the built-in ones don't fit.

As a benchmark harness, LightEval works across several inference backends from one interface: Hugging Face Accelerate for CPU or GPU, vLLM and SGLang for faster GPU serving, Nanotron for distributed settings, and hosted endpoints via TGI, LiteLLM, or inference providers.

What it does

Over 1000 evaluation tasks across knowledge, math, code, chat, and multilingual domains
Runs on multiple backends: Accelerate, vLLM, SGLang, Nanotron, and hosted endpoints
Saves detailed, sample-by-sample results for debugging and model comparison
CLI entry points (lighteval eval, accelerate, vllm, sglang, endpoint) plus a Python API
Custom task and custom metric definitions when built-in ones don't fit
Optional push of results to the Hugging Face Hub

Getting started

Install LightEval with pip, then run an evaluation from the command line or the Python API. Note: the project supports Mac and Linux; it is currently untested on Windows.

Install

Install the package from PyPI. Many optional extras are available for specific backends.

bashbash

pip install lighteval

Log in to the Hugging Face Hub (optional)

Only needed if you want to push results to the Hub. Add your access token via the CLI.

bashbash

hf auth login

Run an evaluation from the CLI

Evaluate a model on a benchmark using a remote inference service.

bashbash

lighteval eval "hf-inference-providers/openai/gpt-oss-20b" gpqa:diamond

Or use the Python API

Evaluate a model already loaded in memory by building a Pipeline.

pythonpython

from transformers import AutoModelForCausalLM

from lighteval.logging.evaluation_tracker import EvaluationTracker
from lighteval.models.transformers.transformers_model import TransformersModel, TransformersModelConfig
from lighteval.pipeline import ParallelismManager, Pipeline, PipelineParameters

MODEL_NAME = "meta-llama/Meta-Llama-3-8B-Instruct"
BENCHMARKS = "gsm8k"

evaluation_tracker = EvaluationTracker(output_dir="./results")
pipeline_params = PipelineParameters(
    launcher_type=ParallelismManager.NONE,
    max_samples=2
)

model = AutoModelForCausalLM.from_pretrained(
  MODEL_NAME, device_map="auto"
)
config = TransformersModelConfig(model_name=MODEL_NAME, batch_size=1)
model = TransformersModel.from_model(model, config)

pipeline = Pipeline(
    model=model,
    pipeline_parameters=pipeline_params,
    evaluation_tracker=evaluation_tracker,
    tasks=BENCHMARKS,
)

results = pipeline.evaluate()
pipeline.show_results()
results = pipeline.get_results()

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Benchmark a fine-tuned model on tasks like MMLU or GSM8K before release
Compare several models on the same benchmark suite using one backend interface
Inspect sample-by-sample outputs to debug where a model fails on a task
Define a custom task or metric to evaluate a model on domain-specific data

How LightEval compares

LightEval alongside other open-source benchmark harnesses tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
LM Evaluation Harness	★ 13k	EleutherAI's framework for few-shot evaluation of language models across 60+ academic benchmarks, used as the backend for many leaderboards.
OpenCompass	★ 7.1k	An LLM evaluation platform that runs models against 100+ datasets covering reasoning, knowledge, coding, and domain tasks, with leaderboards and multi-model support.
SWE-bench	★ 5.2k	A benchmark and containerized harness that tests whether language models can resolve real GitHub issues by generating patches that pass a repository's tests.
simple-evals	★ 4.5k	OpenAI's lightweight library for running standard zero-shot, chain-of-thought benchmarks like MMLU, MATH, and GPQA to measure model accuracy.
lmms-eval	★ 4.2k	An evaluation suite for large multimodal models that runs image, video, and audio benchmarks across many tasks with a unified, reproducible interface.
AgentBench	★ 3.5k	A benchmark that evaluates LLMs as agents across diverse interactive environments such as operating systems, databases, web browsing, and games.
HELM	★ 2.8k	Stanford CRFM's Holistic Evaluation of Language Models framework for reproducible, transparent benchmarking of foundation and multimodal models across many scenarios and metrics.
LightEval	★ 2.5k	All-in-one toolkit for evaluating LLMs across multiple inference backends

// Overview

// What it does

// Getting started

Install

Log in to the Hugging Face Hub (optional)

Run an evaluation from the CLI

Or use the Python API

// When to use it

// How LightEval compares

Overview

What it does

Getting started

When to use it

How LightEval compares