Overview
simple-evals is a small Python library from OpenAI for evaluating language models on standard benchmarks. It was open sourced so the accuracy numbers OpenAI publishes alongside its models can be reproduced and inspected. The emphasis is on a zero-shot, chain-of-thought setup, which the maintainers consider closer to how models are actually used than many-shot prompting.
It targets people who need to measure model accuracy on known academic tests, such as researchers comparing models or engineers checking a model against a published score. Out of the box it covers benchmarks like MMLU, MATH, GPQA, HumanEval, MGSM, DROP, and SimpleQA, and it ships adapters for the OpenAI and Anthropic APIs.
As a benchmark harness, it sits in the evaluation and testing category: you point it at a model name, pick a benchmark and a number of examples, and it produces accuracy numbers. Note that the repo is in maintenance mode and, as of July 2025, is no longer updated for new models or results, though it continues to host reference implementations for HealthBench, BrowseComp, and SimpleQA.
What it does
- Reference implementations of common benchmarks: MMLU, MATH, GPQA, HumanEval, MGSM, DROP, and SimpleQA
- Zero-shot, chain-of-thought prompting setup used for OpenAI's published accuracy numbers
- Built-in adapters for the OpenAI and Anthropic APIs via the `openai` and `anthropic` packages
- Simple CLI: list available models, then run an eval by model name and example count
- Configurable number of examples per run, useful for quick smoke tests or full runs
- Published benchmark tables you can compare your own runs against
Getting started
simple-evals has no unified installer because dependencies are optional; install only what you need, set your API key, then run an eval from the command line.
Install the API client you need
Install the OpenAI client, the Anthropic client, or both, depending on which models you want to evaluate.
pip install openai
pip install anthropicAdd HumanEval support (optional)
The HumanEval benchmark needs the separate human-eval package.
git clone https://github.com/openai/human-eval
pip install -e human-evalSet your API key
Export the relevant *_API_KEY environment variable before running, for example OPENAI_API_KEY or ANTHROPIC_API_KEY.
export OPENAI_API_KEY=sk-...List models and run an eval
List the available model names, then run a benchmark by passing a model and the number of examples.
python -m simple-evals.simple_evals --list-models
python -m simple-evals.simple_evals --model <model_name> --examples <num_examples>Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Reproduce or sanity-check OpenAI's published accuracy numbers for a given model
- Compare two models on benchmarks like MMLU, MATH, or GPQA under the same zero-shot setup
- Run a quick smoke test on a small number of examples before a full benchmark run
- Use the reference implementations of SimpleQA, HealthBench, or BrowseComp as a starting point for your own evals
How simple-evals compares
simple-evals alongside other open-source benchmark harnesses tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| LM Evaluation Harness | ★ 13k | EleutherAI's framework for few-shot evaluation of language models across 60+ academic benchmarks, used as the backend for many leaderboards. |
| OpenCompass | ★ 7.1k | An LLM evaluation platform that runs models against 100+ datasets covering reasoning, knowledge, coding, and domain tasks, with leaderboards and multi-model support. |
| SWE-bench | ★ 5.2k | A benchmark and containerized harness that tests whether language models can resolve real GitHub issues by generating patches that pass a repository's tests. |
| simple-evals | ★ 4.5k | OpenAI's lightweight harness for running standard zero-shot, chain-of-thought LLM benchmarks |
| lmms-eval | ★ 4.2k | An evaluation suite for large multimodal models that runs image, video, and audio benchmarks across many tasks with a unified, reproducible interface. |
| AgentBench | ★ 3.5k | A benchmark that evaluates LLMs as agents across diverse interactive environments such as operating systems, databases, web browsing, and games. |
| HELM | ★ 2.8k | Stanford CRFM's Holistic Evaluation of Language Models framework for reproducible, transparent benchmarking of foundation and multimodal models across many scenarios and metrics. |
| LightEval | ★ 2.5k | Hugging Face's toolkit for evaluating LLMs on standard benchmarks across multiple inference backends, with custom task and metric definitions. |