OpenCompass

One-stop platform for benchmarking large language models across 100+ datasets

github.com/open-compass/opencompass★ 7.1k opencompass.org.cn

Overview

OpenCompass is an open-source platform for evaluating large language models. It runs a model against more than 100 datasets that cover reasoning, knowledge, coding, math, and other tasks, then reports scores you can compare side by side. The project also publishes public leaderboards (CompassRank) and a dataset hub (CompassHub).

It is built for researchers and engineers who need repeatable, comparable benchmark results rather than ad-hoc spot checks. You point it at a model from Hugging Face or an API, pick the datasets you care about, and it handles inference, answer extraction, and scoring through configuration files or simple command-line flags.

As a benchmark harness, OpenCompass sits in the evaluation stage of a model workflow. It supports multiple inference backends such as Hugging Face Transformers, vLLM, and LMDeploy, so you can evaluate the same model on different runtimes and reproduce official leaderboard results locally.

What it does

Runs models against 100+ datasets covering reasoning, knowledge, coding, math, and long-context tasks
Works with Hugging Face models and API models, plus vLLM and LMDeploy inference backends
Configuration-driven: define models, datasets, and summarizers in Python config files or pass them as CLI flags
Public CompassRank leaderboards and the CompassHub dataset hub for comparing results
LLM-as-judge and math-verification evaluators (GenericLLMEvaluator, MATHVerifyEvaluator) for open-ended and reasoning tasks
Both perplexity (ppl) and generative (gen) evaluation modes per dataset

Getting started

Install OpenCompass with pip, then run an evaluation by choosing a model and one or more datasets.

Install

Install the latest release from PyPI. Python 3.10 is recommended.

bashbash

pip install -U opencompass

Run an evaluation with the CLI

Pass a model config and one or more datasets. This example evaluates a Hugging Face model on a demo GSM8K dataset.

bashbash

opencompass --models hf_internlm2_5_1_8b_chat --datasets demo_gsm8k_chat_gen

Evaluate any Hugging Face model by path

Use run.py with --hf-path to point at a model on the Hub and select datasets directly.

bashbash

python run.py --datasets siqa_gen winograd_ppl \
  --hf-type base \
  --hf-path facebook/opt-125m

Use a config file

For repeatable setups, define models, datasets, and summarizers in a Python config and pass it to run.py.

bashbash

python run.py configs/eval_demo.py

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Benchmark a fine-tuned or newly released model against standard datasets before publishing results
Reproduce official leaderboard scores (e.g. CompassAcademic) locally to validate a model
Compare several candidate models on the same task suite to pick one for production
Evaluate reasoning and math performance using LLM-judge or math-verification evaluators

How OpenCompass compares

OpenCompass alongside other open-source benchmark harnesses tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
LM Evaluation Harness	★ 13k	EleutherAI's framework for few-shot evaluation of language models across 60+ academic benchmarks, used as the backend for many leaderboards.
OpenCompass	★ 7.1k	One-stop platform for benchmarking large language models across 100+ datasets
SWE-bench	★ 5.2k	A benchmark and containerized harness that tests whether language models can resolve real GitHub issues by generating patches that pass a repository's tests.
simple-evals	★ 4.5k	OpenAI's lightweight library for running standard zero-shot, chain-of-thought benchmarks like MMLU, MATH, and GPQA to measure model accuracy.
lmms-eval	★ 4.2k	An evaluation suite for large multimodal models that runs image, video, and audio benchmarks across many tasks with a unified, reproducible interface.
AgentBench	★ 3.5k	A benchmark that evaluates LLMs as agents across diverse interactive environments such as operating systems, databases, web browsing, and games.
HELM	★ 2.8k	Stanford CRFM's Holistic Evaluation of Language Models framework for reproducible, transparent benchmarking of foundation and multimodal models across many scenarios and metrics.
LightEval	★ 2.5k	Hugging Face's toolkit for evaluating LLMs on standard benchmarks across multiple inference backends, with custom task and metric definitions.

// Overview

// What it does

// Getting started

Install

Run an evaluation with the CLI

Evaluate any Hugging Face model by path

Use a config file

// When to use it

// How OpenCompass compares

Overview

What it does

Getting started

When to use it

How OpenCompass compares