Overview
HELM (Holistic Evaluation of Language Models) is an open-source Python framework from the Center for Research on Foundation Models (CRFM) at Stanford. It runs standardized benchmarks against foundation models, including large language models and multimodal models, so results are reproducible and transparent.
It is aimed at researchers and engineers who need to compare models on more than a single accuracy number. HELM ships datasets and benchmarks in a common format (such as MMLU-Pro, GPQA, IFEval, and WildBench), reaches many model providers through one interface, and measures aspects like efficiency, bias, and toxicity alongside accuracy.
As a benchmark harness, HELM also includes a web UI for inspecting individual prompts and responses and public leaderboards for comparing models across benchmarks. Note that the project entered maintenance mode on June 1, 2026.
What it does
- Standardized datasets and benchmarks including MMLU-Pro, GPQA, IFEval, and WildBench
- Unified interface to models from many providers, such as OpenAI, Anthropic Claude, and Google Gemini
- Metrics that go beyond accuracy to cover efficiency, bias, and toxicity
- Built-in web UI (helm-server) for inspecting individual prompts and responses
- Public leaderboards for comparing results across models and benchmarks, including HELM Capabilities, HELM Safety, and VHELM
- Reproduces published evaluation results from HELM papers across domains like medicine and finance
Getting started
Install HELM from PyPI, then run a benchmark, summarize the results, and view them in the local web server.
Install the package
Install HELM from PyPI with pip.
pip install crfm-helmRun a benchmark
Run a benchmark entry against a model, choosing a suite name and capping the number of evaluation instances.
helm-run --run-entries mmlu:subject=philosophy,model=openai/gpt2 --suite my-suite --max-eval-instances 10Summarize the results
Aggregate the run output into summary results for your suite.
helm-summarize --suite my-suiteView results in the browser
Start the web server, then open http://localhost:8000/ in your browser.
helm-server --suite my-suiteCommands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Benchmark a new model against standard datasets like MMLU-Pro or GPQA in a reproducible way
- Compare models from different providers through one interface without writing provider-specific glue
- Measure non-accuracy aspects such as efficiency, bias, and toxicity for a model release
- Reproduce published leaderboard results from HELM papers for research or auditing
How HELM compares
HELM alongside other open-source benchmark harnesses tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| LM Evaluation Harness | ★ 13k | EleutherAI's framework for few-shot evaluation of language models across 60+ academic benchmarks, used as the backend for many leaderboards. |
| OpenCompass | ★ 7.1k | An LLM evaluation platform that runs models against 100+ datasets covering reasoning, knowledge, coding, and domain tasks, with leaderboards and multi-model support. |
| SWE-bench | ★ 5.2k | A benchmark and containerized harness that tests whether language models can resolve real GitHub issues by generating patches that pass a repository's tests. |
| simple-evals | ★ 4.5k | OpenAI's lightweight library for running standard zero-shot, chain-of-thought benchmarks like MMLU, MATH, and GPQA to measure model accuracy. |
| lmms-eval | ★ 4.2k | An evaluation suite for large multimodal models that runs image, video, and audio benchmarks across many tasks with a unified, reproducible interface. |
| AgentBench | ★ 3.5k | A benchmark that evaluates LLMs as agents across diverse interactive environments such as operating systems, databases, web browsing, and games. |
| HELM | ★ 2.8k | Holistic, reproducible benchmarking for language and multimodal models |
| LightEval | ★ 2.5k | Hugging Face's toolkit for evaluating LLMs on standard benchmarks across multiple inference backends, with custom task and metric definitions. |