HELM

Holistic, reproducible benchmarking for language and multimodal models

github.com/stanford-crfm/helm★ 2.8k crfm.stanford.edu/helm

Overview

HELM (Holistic Evaluation of Language Models) is an open-source Python framework from the Center for Research on Foundation Models (CRFM) at Stanford. It runs standardized benchmarks against foundation models, including large language models and multimodal models, so results are reproducible and transparent.

It is aimed at researchers and engineers who need to compare models on more than a single accuracy number. HELM ships datasets and benchmarks in a common format (such as MMLU-Pro, GPQA, IFEval, and WildBench), reaches many model providers through one interface, and measures aspects like efficiency, bias, and toxicity alongside accuracy.

As a benchmark harness, HELM also includes a web UI for inspecting individual prompts and responses and public leaderboards for comparing models across benchmarks. Note that the project entered maintenance mode on June 1, 2026.

What it does

Standardized datasets and benchmarks including MMLU-Pro, GPQA, IFEval, and WildBench
Unified interface to models from many providers, such as OpenAI, Anthropic Claude, and Google Gemini
Metrics that go beyond accuracy to cover efficiency, bias, and toxicity
Built-in web UI (helm-server) for inspecting individual prompts and responses
Public leaderboards for comparing results across models and benchmarks, including HELM Capabilities, HELM Safety, and VHELM
Reproduces published evaluation results from HELM papers across domains like medicine and finance

Getting started

Install HELM from PyPI, then run a benchmark, summarize the results, and view them in the local web server.

Install the package

Install HELM from PyPI with pip.

bashbash

pip install crfm-helm

Run a benchmark

Run a benchmark entry against a model, choosing a suite name and capping the number of evaluation instances.

bashbash

helm-run --run-entries mmlu:subject=philosophy,model=openai/gpt2 --suite my-suite --max-eval-instances 10

Summarize the results

Aggregate the run output into summary results for your suite.

bashbash

helm-summarize --suite my-suite

View results in the browser

Start the web server, then open http://localhost:8000/ in your browser.

bashbash

helm-server --suite my-suite

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Benchmark a new model against standard datasets like MMLU-Pro or GPQA in a reproducible way
Compare models from different providers through one interface without writing provider-specific glue
Measure non-accuracy aspects such as efficiency, bias, and toxicity for a model release
Reproduce published leaderboard results from HELM papers for research or auditing

How HELM compares

HELM alongside other open-source benchmark harnesses tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
LM Evaluation Harness	★ 13k	EleutherAI's framework for few-shot evaluation of language models across 60+ academic benchmarks, used as the backend for many leaderboards.
OpenCompass	★ 7.1k	An LLM evaluation platform that runs models against 100+ datasets covering reasoning, knowledge, coding, and domain tasks, with leaderboards and multi-model support.
SWE-bench	★ 5.2k	A benchmark and containerized harness that tests whether language models can resolve real GitHub issues by generating patches that pass a repository's tests.
simple-evals	★ 4.5k	OpenAI's lightweight library for running standard zero-shot, chain-of-thought benchmarks like MMLU, MATH, and GPQA to measure model accuracy.
lmms-eval	★ 4.2k	An evaluation suite for large multimodal models that runs image, video, and audio benchmarks across many tasks with a unified, reproducible interface.
AgentBench	★ 3.5k	A benchmark that evaluates LLMs as agents across diverse interactive environments such as operating systems, databases, web browsing, and games.
HELM	★ 2.8k	Holistic, reproducible benchmarking for language and multimodal models
LightEval	★ 2.5k	Hugging Face's toolkit for evaluating LLMs on standard benchmarks across multiple inference backends, with custom task and metric definitions.

// Overview

// What it does

// Getting started

Install the package

Run a benchmark

Summarize the results

View results in the browser

// When to use it

// How HELM compares

Overview

What it does

Getting started

When to use it

How HELM compares