Overview
lmms-eval is an evaluation toolkit for large multimodal models (LMMs) - models that work with images, video, and audio, not just text. It bundles 100+ benchmark tasks and 30+ model backends behind a single command-line interface, so you can score a model on many datasets without wiring up each one yourself.
It is built for model teams and researchers who need eval numbers they can act on. The project focuses on three goals: reproducible runs that give the same numbers every time, efficient evaluation that keeps GPUs busy at scale, and trustworthy results that go beyond a single accuracy score with confidence intervals and paired comparisons.
As a benchmark harness, it sits in the evaluation and testing stage of a model's lifecycle. You point it at a pretrained model and a list of tasks, and it handles dataset loading, generation, post-processing, and metric reporting in one pipeline.
What it does
- 100+ built-in benchmark tasks spanning image, video, and audio evaluation
- 30+ model backends, including Qwen2.5-VL and other vision-language models
- Single unified CLI: choose a model and a task list, get consistent metrics
- Reproducible pipeline designed to return the same numbers on repeat runs
- Statistical reporting beyond raw accuracy: confidence intervals and paired comparisons
- Efficiency features like async serving, adaptive batching, and faster video I/O via TorchCodec
Getting started
Clone the repo, install with uv, and run a small evaluation to confirm your environment works.
Install from source with uv
Clone the repository and install it (with all extras) using uv, the recommended package manager for a consistent environment.
git clone https://github.com/EvolvingLMMs-Lab/lmms-eval.git
cd lmms-eval && uv pip install -e ".[all]"Run your first evaluation
Evaluate Qwen2.5-VL on the MME benchmark with a small sample limit. If it prints metrics, your setup is ready.
python -m lmms_eval \
--model qwen2_5_vl \
--model_args pretrained=Qwen/Qwen2.5-VL-3B-Instruct \
--tasks mme \
--batch_size 1 \
--limit 8Explore available options
List the supported flags, models, and tasks to plan a fuller evaluation run.
uv run python -m lmms_eval --helpCommands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Benchmark a new multimodal model across many vision, video, and audio tasks with one command
- Reproduce paper results for a model so two teams report the same numbers
- Compare two model checkpoints with confidence intervals and paired tests instead of single-number accuracy
- Run a quick smoke test on a handful of samples to confirm a model and environment are wired up correctly
How lmms-eval compares
lmms-eval alongside other open-source benchmark harnesses tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| LM Evaluation Harness | ★ 13k | EleutherAI's framework for few-shot evaluation of language models across 60+ academic benchmarks, used as the backend for many leaderboards. |
| OpenCompass | ★ 7.1k | An LLM evaluation platform that runs models against 100+ datasets covering reasoning, knowledge, coding, and domain tasks, with leaderboards and multi-model support. |
| SWE-bench | ★ 5.2k | A benchmark and containerized harness that tests whether language models can resolve real GitHub issues by generating patches that pass a repository's tests. |
| simple-evals | ★ 4.5k | OpenAI's lightweight library for running standard zero-shot, chain-of-thought benchmarks like MMLU, MATH, and GPQA to measure model accuracy. |
| lmms-eval | ★ 4.2k | Reproducible evaluation suite for large multimodal models across image, video, and audio |
| AgentBench | ★ 3.5k | A benchmark that evaluates LLMs as agents across diverse interactive environments such as operating systems, databases, web browsing, and games. |
| HELM | ★ 2.8k | Stanford CRFM's Holistic Evaluation of Language Models framework for reproducible, transparent benchmarking of foundation and multimodal models across many scenarios and metrics. |
| LightEval | ★ 2.5k | Hugging Face's toolkit for evaluating LLMs on standard benchmarks across multiple inference backends, with custom task and metric definitions. |