Overview
OpenCompass is an open-source platform for evaluating large language models. It runs a model against more than 100 datasets that cover reasoning, knowledge, coding, math, and other tasks, then reports scores you can compare side by side. The project also publishes public leaderboards (CompassRank) and a dataset hub (CompassHub).
It is built for researchers and engineers who need repeatable, comparable benchmark results rather than ad-hoc spot checks. You point it at a model from Hugging Face or an API, pick the datasets you care about, and it handles inference, answer extraction, and scoring through configuration files or simple command-line flags.
As a benchmark harness, OpenCompass sits in the evaluation stage of a model workflow. It supports multiple inference backends such as Hugging Face Transformers, vLLM, and LMDeploy, so you can evaluate the same model on different runtimes and reproduce official leaderboard results locally.
What it does
- Runs models against 100+ datasets covering reasoning, knowledge, coding, math, and long-context tasks
- Works with Hugging Face models and API models, plus vLLM and LMDeploy inference backends
- Configuration-driven: define models, datasets, and summarizers in Python config files or pass them as CLI flags
- Public CompassRank leaderboards and the CompassHub dataset hub for comparing results
- LLM-as-judge and math-verification evaluators (GenericLLMEvaluator, MATHVerifyEvaluator) for open-ended and reasoning tasks
- Both perplexity (ppl) and generative (gen) evaluation modes per dataset
Getting started
Install OpenCompass with pip, then run an evaluation by choosing a model and one or more datasets.
Install
Install the latest release from PyPI. Python 3.10 is recommended.
pip install -U opencompassRun an evaluation with the CLI
Pass a model config and one or more datasets. This example evaluates a Hugging Face model on a demo GSM8K dataset.
opencompass --models hf_internlm2_5_1_8b_chat --datasets demo_gsm8k_chat_genEvaluate any Hugging Face model by path
Use run.py with --hf-path to point at a model on the Hub and select datasets directly.
python run.py --datasets siqa_gen winograd_ppl \
--hf-type base \
--hf-path facebook/opt-125mUse a config file
For repeatable setups, define models, datasets, and summarizers in a Python config and pass it to run.py.
python run.py configs/eval_demo.pyCommands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Benchmark a fine-tuned or newly released model against standard datasets before publishing results
- Reproduce official leaderboard scores (e.g. CompassAcademic) locally to validate a model
- Compare several candidate models on the same task suite to pick one for production
- Evaluate reasoning and math performance using LLM-judge or math-verification evaluators
How OpenCompass compares
OpenCompass alongside other open-source benchmark harnesses tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| LM Evaluation Harness | ★ 13k | EleutherAI's framework for few-shot evaluation of language models across 60+ academic benchmarks, used as the backend for many leaderboards. |
| OpenCompass | ★ 7.1k | One-stop platform for benchmarking large language models across 100+ datasets |
| SWE-bench | ★ 5.2k | A benchmark and containerized harness that tests whether language models can resolve real GitHub issues by generating patches that pass a repository's tests. |
| simple-evals | ★ 4.5k | OpenAI's lightweight library for running standard zero-shot, chain-of-thought benchmarks like MMLU, MATH, and GPQA to measure model accuracy. |
| lmms-eval | ★ 4.2k | An evaluation suite for large multimodal models that runs image, video, and audio benchmarks across many tasks with a unified, reproducible interface. |
| AgentBench | ★ 3.5k | A benchmark that evaluates LLMs as agents across diverse interactive environments such as operating systems, databases, web browsing, and games. |
| HELM | ★ 2.8k | Stanford CRFM's Holistic Evaluation of Language Models framework for reproducible, transparent benchmarking of foundation and multimodal models across many scenarios and metrics. |
| LightEval | ★ 2.5k | Hugging Face's toolkit for evaluating LLMs on standard benchmarks across multiple inference backends, with custom task and metric definitions. |