AI/TLDR

HELM

Holistic, reproducible benchmarking for language and multimodal models

Overview

HELM (Holistic Evaluation of Language Models) is an open-source Python framework from the Center for Research on Foundation Models (CRFM) at Stanford. It runs standardized benchmarks against foundation models, including large language models and multimodal models, so results are reproducible and transparent.

It is aimed at researchers and engineers who need to compare models on more than a single accuracy number. HELM ships datasets and benchmarks in a common format (such as MMLU-Pro, GPQA, IFEval, and WildBench), reaches many model providers through one interface, and measures aspects like efficiency, bias, and toxicity alongside accuracy.

As a benchmark harness, HELM also includes a web UI for inspecting individual prompts and responses and public leaderboards for comparing models across benchmarks. Note that the project entered maintenance mode on June 1, 2026.

What it does

  • Standardized datasets and benchmarks including MMLU-Pro, GPQA, IFEval, and WildBench
  • Unified interface to models from many providers, such as OpenAI, Anthropic Claude, and Google Gemini
  • Metrics that go beyond accuracy to cover efficiency, bias, and toxicity
  • Built-in web UI (helm-server) for inspecting individual prompts and responses
  • Public leaderboards for comparing results across models and benchmarks, including HELM Capabilities, HELM Safety, and VHELM
  • Reproduces published evaluation results from HELM papers across domains like medicine and finance

Getting started

Install HELM from PyPI, then run a benchmark, summarize the results, and view them in the local web server.

Install the package

Install HELM from PyPI with pip.

bashbash
pip install crfm-helm

Run a benchmark

Run a benchmark entry against a model, choosing a suite name and capping the number of evaluation instances.

bashbash
helm-run --run-entries mmlu:subject=philosophy,model=openai/gpt2 --suite my-suite --max-eval-instances 10

Summarize the results

Aggregate the run output into summary results for your suite.

bashbash
helm-summarize --suite my-suite

View results in the browser

Start the web server, then open http://localhost:8000/ in your browser.

bashbash
helm-server --suite my-suite

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Benchmark a new model against standard datasets like MMLU-Pro or GPQA in a reproducible way
  • Compare models from different providers through one interface without writing provider-specific glue
  • Measure non-accuracy aspects such as efficiency, bias, and toxicity for a model release
  • Reproduce published leaderboard results from HELM papers for research or auditing

How HELM compares

HELM alongside other open-source benchmark harnesses tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
LM Evaluation Harness★ 13kEleutherAI's framework for few-shot evaluation of language models across 60+ academic benchmarks, used as the backend for many leaderboards.
OpenCompass★ 7.1kAn LLM evaluation platform that runs models against 100+ datasets covering reasoning, knowledge, coding, and domain tasks, with leaderboards and multi-model support.
SWE-bench★ 5.2kA benchmark and containerized harness that tests whether language models can resolve real GitHub issues by generating patches that pass a repository's tests.
simple-evals★ 4.5kOpenAI's lightweight library for running standard zero-shot, chain-of-thought benchmarks like MMLU, MATH, and GPQA to measure model accuracy.
lmms-eval★ 4.2kAn evaluation suite for large multimodal models that runs image, video, and audio benchmarks across many tasks with a unified, reproducible interface.
AgentBench★ 3.5kA benchmark that evaluates LLMs as agents across diverse interactive environments such as operating systems, databases, web browsing, and games.
HELM★ 2.8kHolistic, reproducible benchmarking for language and multimodal models
LightEval★ 2.5kHugging Face's toolkit for evaluating LLMs on standard benchmarks across multiple inference backends, with custom task and metric definitions.