AI/TLDR

simple-evals

OpenAI's lightweight harness for running standard zero-shot, chain-of-thought LLM benchmarks

Overview

simple-evals is a small Python library from OpenAI for evaluating language models on standard benchmarks. It was open sourced so the accuracy numbers OpenAI publishes alongside its models can be reproduced and inspected. The emphasis is on a zero-shot, chain-of-thought setup, which the maintainers consider closer to how models are actually used than many-shot prompting.

It targets people who need to measure model accuracy on known academic tests, such as researchers comparing models or engineers checking a model against a published score. Out of the box it covers benchmarks like MMLU, MATH, GPQA, HumanEval, MGSM, DROP, and SimpleQA, and it ships adapters for the OpenAI and Anthropic APIs.

As a benchmark harness, it sits in the evaluation and testing category: you point it at a model name, pick a benchmark and a number of examples, and it produces accuracy numbers. Note that the repo is in maintenance mode and, as of July 2025, is no longer updated for new models or results, though it continues to host reference implementations for HealthBench, BrowseComp, and SimpleQA.

What it does

  • Reference implementations of common benchmarks: MMLU, MATH, GPQA, HumanEval, MGSM, DROP, and SimpleQA
  • Zero-shot, chain-of-thought prompting setup used for OpenAI's published accuracy numbers
  • Built-in adapters for the OpenAI and Anthropic APIs via the `openai` and `anthropic` packages
  • Simple CLI: list available models, then run an eval by model name and example count
  • Configurable number of examples per run, useful for quick smoke tests or full runs
  • Published benchmark tables you can compare your own runs against

Getting started

simple-evals has no unified installer because dependencies are optional; install only what you need, set your API key, then run an eval from the command line.

Install the API client you need

Install the OpenAI client, the Anthropic client, or both, depending on which models you want to evaluate.

bashbash
pip install openai
pip install anthropic

Add HumanEval support (optional)

The HumanEval benchmark needs the separate human-eval package.

bashbash
git clone https://github.com/openai/human-eval
pip install -e human-eval

Set your API key

Export the relevant *_API_KEY environment variable before running, for example OPENAI_API_KEY or ANTHROPIC_API_KEY.

bashbash
export OPENAI_API_KEY=sk-...

List models and run an eval

List the available model names, then run a benchmark by passing a model and the number of examples.

bashbash
python -m simple-evals.simple_evals --list-models
python -m simple-evals.simple_evals --model <model_name> --examples <num_examples>

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Reproduce or sanity-check OpenAI's published accuracy numbers for a given model
  • Compare two models on benchmarks like MMLU, MATH, or GPQA under the same zero-shot setup
  • Run a quick smoke test on a small number of examples before a full benchmark run
  • Use the reference implementations of SimpleQA, HealthBench, or BrowseComp as a starting point for your own evals

How simple-evals compares

simple-evals alongside other open-source benchmark harnesses tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
LM Evaluation Harness★ 13kEleutherAI's framework for few-shot evaluation of language models across 60+ academic benchmarks, used as the backend for many leaderboards.
OpenCompass★ 7.1kAn LLM evaluation platform that runs models against 100+ datasets covering reasoning, knowledge, coding, and domain tasks, with leaderboards and multi-model support.
SWE-bench★ 5.2kA benchmark and containerized harness that tests whether language models can resolve real GitHub issues by generating patches that pass a repository's tests.
simple-evals★ 4.5kOpenAI's lightweight harness for running standard zero-shot, chain-of-thought LLM benchmarks
lmms-eval★ 4.2kAn evaluation suite for large multimodal models that runs image, video, and audio benchmarks across many tasks with a unified, reproducible interface.
AgentBench★ 3.5kA benchmark that evaluates LLMs as agents across diverse interactive environments such as operating systems, databases, web browsing, and games.
HELM★ 2.8kStanford CRFM's Holistic Evaluation of Language Models framework for reproducible, transparent benchmarking of foundation and multimodal models across many scenarios and metrics.
LightEval★ 2.5kHugging Face's toolkit for evaluating LLMs on standard benchmarks across multiple inference backends, with custom task and metric definitions.