lmms-eval

Reproducible evaluation suite for large multimodal models across image, video, and audio

github.com/EvolvingLMMs-Lab/lmms-eval★ 4.2k lmms-lab.github.io

Overview

lmms-eval is an evaluation toolkit for large multimodal models (LMMs) - models that work with images, video, and audio, not just text. It bundles 100+ benchmark tasks and 30+ model backends behind a single command-line interface, so you can score a model on many datasets without wiring up each one yourself.

It is built for model teams and researchers who need eval numbers they can act on. The project focuses on three goals: reproducible runs that give the same numbers every time, efficient evaluation that keeps GPUs busy at scale, and trustworthy results that go beyond a single accuracy score with confidence intervals and paired comparisons.

As a benchmark harness, it sits in the evaluation and testing stage of a model's lifecycle. You point it at a pretrained model and a list of tasks, and it handles dataset loading, generation, post-processing, and metric reporting in one pipeline.

What it does

100+ built-in benchmark tasks spanning image, video, and audio evaluation
30+ model backends, including Qwen2.5-VL and other vision-language models
Single unified CLI: choose a model and a task list, get consistent metrics
Reproducible pipeline designed to return the same numbers on repeat runs
Statistical reporting beyond raw accuracy: confidence intervals and paired comparisons
Efficiency features like async serving, adaptive batching, and faster video I/O via TorchCodec

Getting started

Clone the repo, install with uv, and run a small evaluation to confirm your environment works.

Install from source with uv

Clone the repository and install it (with all extras) using uv, the recommended package manager for a consistent environment.

bashbash

git clone https://github.com/EvolvingLMMs-Lab/lmms-eval.git
cd lmms-eval && uv pip install -e ".[all]"

Run your first evaluation

Evaluate Qwen2.5-VL on the MME benchmark with a small sample limit. If it prints metrics, your setup is ready.

bashbash

python -m lmms_eval \
  --model qwen2_5_vl \
  --model_args pretrained=Qwen/Qwen2.5-VL-3B-Instruct \
  --tasks mme \
  --batch_size 1 \
  --limit 8

Explore available options

List the supported flags, models, and tasks to plan a fuller evaluation run.

bashbash

uv run python -m lmms_eval --help

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Benchmark a new multimodal model across many vision, video, and audio tasks with one command
Reproduce paper results for a model so two teams report the same numbers
Compare two model checkpoints with confidence intervals and paired tests instead of single-number accuracy
Run a quick smoke test on a handful of samples to confirm a model and environment are wired up correctly

How lmms-eval compares

lmms-eval alongside other open-source benchmark harnesses tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
LM Evaluation Harness	★ 13k	EleutherAI's framework for few-shot evaluation of language models across 60+ academic benchmarks, used as the backend for many leaderboards.
OpenCompass	★ 7.1k	An LLM evaluation platform that runs models against 100+ datasets covering reasoning, knowledge, coding, and domain tasks, with leaderboards and multi-model support.
SWE-bench	★ 5.2k	A benchmark and containerized harness that tests whether language models can resolve real GitHub issues by generating patches that pass a repository's tests.
simple-evals	★ 4.5k	OpenAI's lightweight library for running standard zero-shot, chain-of-thought benchmarks like MMLU, MATH, and GPQA to measure model accuracy.
lmms-eval	★ 4.2k	Reproducible evaluation suite for large multimodal models across image, video, and audio
AgentBench	★ 3.5k	A benchmark that evaluates LLMs as agents across diverse interactive environments such as operating systems, databases, web browsing, and games.
HELM	★ 2.8k	Stanford CRFM's Holistic Evaluation of Language Models framework for reproducible, transparent benchmarking of foundation and multimodal models across many scenarios and metrics.
LightEval	★ 2.5k	Hugging Face's toolkit for evaluating LLMs on standard benchmarks across multiple inference backends, with custom task and metric definitions.

// Overview

// What it does

// Getting started

Install from source with uv

Run your first evaluation

Explore available options

// When to use it

// How lmms-eval compares

Overview

What it does

Getting started

When to use it

How lmms-eval compares