AI/TLDR

LightEval

All-in-one toolkit for evaluating LLMs across multiple inference backends

Overview

LightEval is an open-source toolkit from Hugging Face's Leaderboard and Evals team for evaluating large language models. It runs models on standard benchmarks whether the model is served somewhere remotely or already loaded in memory, and it saves detailed sample-by-sample results so you can debug and compare how models perform.

It is built for ML engineers and researchers who need to measure model quality on known tasks. The library ships with over 1000 evaluation tasks spanning knowledge, math, code, chat, and multilingual domains, including familiar benchmarks like MMLU, GSM8K, IFEval, and GPQA. You can also define your own custom tasks and metrics when the built-in ones don't fit.

As a benchmark harness, LightEval works across several inference backends from one interface: Hugging Face Accelerate for CPU or GPU, vLLM and SGLang for faster GPU serving, Nanotron for distributed settings, and hosted endpoints via TGI, LiteLLM, or inference providers.

What it does

  • Over 1000 evaluation tasks across knowledge, math, code, chat, and multilingual domains
  • Runs on multiple backends: Accelerate, vLLM, SGLang, Nanotron, and hosted endpoints
  • Saves detailed, sample-by-sample results for debugging and model comparison
  • CLI entry points (lighteval eval, accelerate, vllm, sglang, endpoint) plus a Python API
  • Custom task and custom metric definitions when built-in ones don't fit
  • Optional push of results to the Hugging Face Hub

Getting started

Install LightEval with pip, then run an evaluation from the command line or the Python API. Note: the project supports Mac and Linux; it is currently untested on Windows.

Install

Install the package from PyPI. Many optional extras are available for specific backends.

bashbash
pip install lighteval

Log in to the Hugging Face Hub (optional)

Only needed if you want to push results to the Hub. Add your access token via the CLI.

bashbash
hf auth login

Run an evaluation from the CLI

Evaluate a model on a benchmark using a remote inference service.

bashbash
lighteval eval "hf-inference-providers/openai/gpt-oss-20b" gpqa:diamond

Or use the Python API

Evaluate a model already loaded in memory by building a Pipeline.

pythonpython
from transformers import AutoModelForCausalLM

from lighteval.logging.evaluation_tracker import EvaluationTracker
from lighteval.models.transformers.transformers_model import TransformersModel, TransformersModelConfig
from lighteval.pipeline import ParallelismManager, Pipeline, PipelineParameters

MODEL_NAME = "meta-llama/Meta-Llama-3-8B-Instruct"
BENCHMARKS = "gsm8k"

evaluation_tracker = EvaluationTracker(output_dir="./results")
pipeline_params = PipelineParameters(
    launcher_type=ParallelismManager.NONE,
    max_samples=2
)

model = AutoModelForCausalLM.from_pretrained(
  MODEL_NAME, device_map="auto"
)
config = TransformersModelConfig(model_name=MODEL_NAME, batch_size=1)
model = TransformersModel.from_model(model, config)

pipeline = Pipeline(
    model=model,
    pipeline_parameters=pipeline_params,
    evaluation_tracker=evaluation_tracker,
    tasks=BENCHMARKS,
)

results = pipeline.evaluate()
pipeline.show_results()
results = pipeline.get_results()

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Benchmark a fine-tuned model on tasks like MMLU or GSM8K before release
  • Compare several models on the same benchmark suite using one backend interface
  • Inspect sample-by-sample outputs to debug where a model fails on a task
  • Define a custom task or metric to evaluate a model on domain-specific data

How LightEval compares

LightEval alongside other open-source benchmark harnesses tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
LM Evaluation Harness★ 13kEleutherAI's framework for few-shot evaluation of language models across 60+ academic benchmarks, used as the backend for many leaderboards.
OpenCompass★ 7.1kAn LLM evaluation platform that runs models against 100+ datasets covering reasoning, knowledge, coding, and domain tasks, with leaderboards and multi-model support.
SWE-bench★ 5.2kA benchmark and containerized harness that tests whether language models can resolve real GitHub issues by generating patches that pass a repository's tests.
simple-evals★ 4.5kOpenAI's lightweight library for running standard zero-shot, chain-of-thought benchmarks like MMLU, MATH, and GPQA to measure model accuracy.
lmms-eval★ 4.2kAn evaluation suite for large multimodal models that runs image, video, and audio benchmarks across many tasks with a unified, reproducible interface.
AgentBench★ 3.5kA benchmark that evaluates LLMs as agents across diverse interactive environments such as operating systems, databases, web browsing, and games.
HELM★ 2.8kStanford CRFM's Holistic Evaluation of Language Models framework for reproducible, transparent benchmarking of foundation and multimodal models across many scenarios and metrics.
LightEval★ 2.5kAll-in-one toolkit for evaluating LLMs across multiple inference backends