simple-evals

OpenAI's lightweight harness for running standard zero-shot, chain-of-thought LLM benchmarks

Overview

simple-evals is a small Python library from OpenAI for evaluating language models on standard benchmarks. It was open sourced so the accuracy numbers OpenAI publishes alongside its models can be reproduced and inspected. The emphasis is on a zero-shot, chain-of-thought setup, which the maintainers consider closer to how models are actually used than many-shot prompting.

It targets people who need to measure model accuracy on known academic tests, such as researchers comparing models or engineers checking a model against a published score. Out of the box it covers benchmarks like MMLU, MATH, GPQA, HumanEval, MGSM, DROP, and SimpleQA, and it ships adapters for the OpenAI and Anthropic APIs.

As a benchmark harness, it sits in the evaluation and testing category: you point it at a model name, pick a benchmark and a number of examples, and it produces accuracy numbers. Note that the repo is in maintenance mode and, as of July 2025, is no longer updated for new models or results, though it continues to host reference implementations for HealthBench, BrowseComp, and SimpleQA.

What it does

Reference implementations of common benchmarks: MMLU, MATH, GPQA, HumanEval, MGSM, DROP, and SimpleQA
Zero-shot, chain-of-thought prompting setup used for OpenAI's published accuracy numbers
Built-in adapters for the OpenAI and Anthropic APIs via the `openai` and `anthropic` packages
Simple CLI: list available models, then run an eval by model name and example count
Configurable number of examples per run, useful for quick smoke tests or full runs
Published benchmark tables you can compare your own runs against

Getting started

simple-evals has no unified installer because dependencies are optional; install only what you need, set your API key, then run an eval from the command line.

Install the API client you need

Install the OpenAI client, the Anthropic client, or both, depending on which models you want to evaluate.

bashbash

pip install openai
pip install anthropic

Add HumanEval support (optional)

The HumanEval benchmark needs the separate human-eval package.

bashbash

git clone https://github.com/openai/human-eval
pip install -e human-eval

Set your API key

Export the relevant *_API_KEY environment variable before running, for example OPENAI_API_KEY or ANTHROPIC_API_KEY.

bashbash

export OPENAI_API_KEY=sk-...

List models and run an eval

List the available model names, then run a benchmark by passing a model and the number of examples.

bashbash

python -m simple-evals.simple_evals --list-models
python -m simple-evals.simple_evals --model <model_name> --examples <num_examples>

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Reproduce or sanity-check OpenAI's published accuracy numbers for a given model
Compare two models on benchmarks like MMLU, MATH, or GPQA under the same zero-shot setup
Run a quick smoke test on a small number of examples before a full benchmark run
Use the reference implementations of SimpleQA, HealthBench, or BrowseComp as a starting point for your own evals

How simple-evals compares

simple-evals alongside other open-source benchmark harnesses tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
LM Evaluation Harness	★ 13k	EleutherAI's framework for few-shot evaluation of language models across 60+ academic benchmarks, used as the backend for many leaderboards.
OpenCompass	★ 7.1k	An LLM evaluation platform that runs models against 100+ datasets covering reasoning, knowledge, coding, and domain tasks, with leaderboards and multi-model support.
SWE-bench	★ 5.2k	A benchmark and containerized harness that tests whether language models can resolve real GitHub issues by generating patches that pass a repository's tests.
simple-evals	★ 4.5k	OpenAI's lightweight harness for running standard zero-shot, chain-of-thought LLM benchmarks
lmms-eval	★ 4.2k	An evaluation suite for large multimodal models that runs image, video, and audio benchmarks across many tasks with a unified, reproducible interface.
AgentBench	★ 3.5k	A benchmark that evaluates LLMs as agents across diverse interactive environments such as operating systems, databases, web browsing, and games.
HELM	★ 2.8k	Stanford CRFM's Holistic Evaluation of Language Models framework for reproducible, transparent benchmarking of foundation and multimodal models across many scenarios and metrics.
LightEval	★ 2.5k	Hugging Face's toolkit for evaluating LLMs on standard benchmarks across multiple inference backends, with custom task and metric definitions.

// Overview

// What it does

// Getting started

Install the API client you need

Add HumanEval support (optional)

Set your API key

List models and run an eval

// When to use it

// How simple-evals compares

Overview

What it does

Getting started

When to use it

How simple-evals compares