AI/TLDR

lmms-eval

Reproducible evaluation suite for large multimodal models across image, video, and audio

Overview

lmms-eval is an evaluation toolkit for large multimodal models (LMMs) - models that work with images, video, and audio, not just text. It bundles 100+ benchmark tasks and 30+ model backends behind a single command-line interface, so you can score a model on many datasets without wiring up each one yourself.

It is built for model teams and researchers who need eval numbers they can act on. The project focuses on three goals: reproducible runs that give the same numbers every time, efficient evaluation that keeps GPUs busy at scale, and trustworthy results that go beyond a single accuracy score with confidence intervals and paired comparisons.

As a benchmark harness, it sits in the evaluation and testing stage of a model's lifecycle. You point it at a pretrained model and a list of tasks, and it handles dataset loading, generation, post-processing, and metric reporting in one pipeline.

What it does

  • 100+ built-in benchmark tasks spanning image, video, and audio evaluation
  • 30+ model backends, including Qwen2.5-VL and other vision-language models
  • Single unified CLI: choose a model and a task list, get consistent metrics
  • Reproducible pipeline designed to return the same numbers on repeat runs
  • Statistical reporting beyond raw accuracy: confidence intervals and paired comparisons
  • Efficiency features like async serving, adaptive batching, and faster video I/O via TorchCodec

Getting started

Clone the repo, install with uv, and run a small evaluation to confirm your environment works.

Install from source with uv

Clone the repository and install it (with all extras) using uv, the recommended package manager for a consistent environment.

bashbash
git clone https://github.com/EvolvingLMMs-Lab/lmms-eval.git
cd lmms-eval && uv pip install -e ".[all]"

Run your first evaluation

Evaluate Qwen2.5-VL on the MME benchmark with a small sample limit. If it prints metrics, your setup is ready.

bashbash
python -m lmms_eval \
  --model qwen2_5_vl \
  --model_args pretrained=Qwen/Qwen2.5-VL-3B-Instruct \
  --tasks mme \
  --batch_size 1 \
  --limit 8

Explore available options

List the supported flags, models, and tasks to plan a fuller evaluation run.

bashbash
uv run python -m lmms_eval --help

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Benchmark a new multimodal model across many vision, video, and audio tasks with one command
  • Reproduce paper results for a model so two teams report the same numbers
  • Compare two model checkpoints with confidence intervals and paired tests instead of single-number accuracy
  • Run a quick smoke test on a handful of samples to confirm a model and environment are wired up correctly

How lmms-eval compares

lmms-eval alongside other open-source benchmark harnesses tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
LM Evaluation Harness★ 13kEleutherAI's framework for few-shot evaluation of language models across 60+ academic benchmarks, used as the backend for many leaderboards.
OpenCompass★ 7.1kAn LLM evaluation platform that runs models against 100+ datasets covering reasoning, knowledge, coding, and domain tasks, with leaderboards and multi-model support.
SWE-bench★ 5.2kA benchmark and containerized harness that tests whether language models can resolve real GitHub issues by generating patches that pass a repository's tests.
simple-evals★ 4.5kOpenAI's lightweight library for running standard zero-shot, chain-of-thought benchmarks like MMLU, MATH, and GPQA to measure model accuracy.
lmms-eval★ 4.2kReproducible evaluation suite for large multimodal models across image, video, and audio
AgentBench★ 3.5kA benchmark that evaluates LLMs as agents across diverse interactive environments such as operating systems, databases, web browsing, and games.
HELM★ 2.8kStanford CRFM's Holistic Evaluation of Language Models framework for reproducible, transparent benchmarking of foundation and multimodal models across many scenarios and metrics.
LightEval★ 2.5kHugging Face's toolkit for evaluating LLMs on standard benchmarks across multiple inference backends, with custom task and metric definitions.