OpenAI Evals

A framework and open registry for evaluating LLMs and LLM-based systems

Overview

OpenAI Evals is a framework for evaluating large language models (LLMs) and systems built on top of them. It ships with an open registry of existing evals that test different dimensions of OpenAI models, and it lets you write your own custom evals for the use cases you care about.

It is aimed at developers building with LLMs who need a repeatable way to measure how different model versions affect their application. Many basic and model-graded evals require no evaluation code at all: you provide your data in JSON and set parameters in YAML. You can also build private evals against your own data without exposing it publicly.

Within the evaluation and testing space, Evals acts as a standard harness for running and comparing model behaviour. It supports more advanced setups, such as prompt chains and tool-using agents, through its Completion Function Protocol.

What it does

Open registry of ready-made evals covering many dimensions of model behaviour
Write custom evals, including model-graded evals defined in YAML
Build private evals against your own data without publishing it
Eval templates that need no evaluation code, just JSON data plus YAML parameters
Completion Function Protocol for prompt chains and tool-using agents
Optional logging of results to a Snowflake database or Weights & Biases

Getting started

You need Python 3.9 or newer and an OpenAI API key. Set the key once, then install the package depending on whether you want to run existing evals or create your own.

Set your OpenAI API key

Evals call the OpenAI API, so export your key as an environment variable before running anything. Be aware of the API usage costs.

bashbash

export OPENAI_API_KEY=your-api-key

Install to run existing evals

If you only want to run evals locally rather than contribute new ones, install the package from PyPI.

bashbash

pip install evals

Install from source to create evals

If you plan to write evals, clone the repo and install in editable mode so your changes take effect without reinstalling.

bashbash

pip install -e .

Fetch registry data

The evals registry is stored with Git-LFS. After installing LFS, pull the data files from within your local copy of the repo.

bashbash

cd evals
git lfs fetch --all
git lfs pull

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Measure how a new model version changes the behaviour of your LLM application before you ship it
Run existing registry evals to compare models across different tasks
Build a private eval on your own production data without exposing it publicly
Evaluate prompt chains or tool-using agents with the Completion Function Protocol

How OpenAI Evals compares

OpenAI Evals alongside other open-source evaluation & red-teaming tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Strix	★ 26.1k	Strix runs autonomous AI agents that act like hackers, dynamically running your code to find vulnerabilities and validate them with real proof-of-concepts.
promptfoo	★ 22.4k	A developer-first CLI and library for testing and comparing prompts and models, with red-teaming probes for prompt injection, PII leaks, and other vulnerabilities.
OpenAI Evals	★ 18.7k	A framework and open registry for evaluating LLMs and LLM-based systems
DeepEval	★ 16.3k	An open-source Python framework that tests LLM apps like unit tests, with 50+ metrics for RAG, agents, chatbots, and safety, and a Pytest integration for CI/CD.
Ragas	★ 14.4k	An evaluation toolkit focused on retrieval-augmented generation that scores answer faithfulness, context precision/recall, and relevancy, often without needing ground-truth labels.
Arize Phoenix	★ 10.2k	An open-source observability and evaluation tool for tracing LLM and agent behavior, running evals on traces, and troubleshooting issues in development and production.
garak	★ 8.2k	An LLM vulnerability scanner from NVIDIA with 100+ attack probes that test models for prompt injection, data leakage, jailbreaks, and other security weaknesses.
Giskard	★ 5.4k	An open-source library for testing and scanning LLM and ML models for issues like hallucination, bias, and toxicity, including multi-turn agent testing and a vulnerability scanner.

// Overview

// What it does

// Getting started

Set your OpenAI API key

Install to run existing evals

Install from source to create evals

Fetch registry data

// When to use it

// How OpenAI Evals compares

Overview

What it does

Getting started

When to use it

How OpenAI Evals compares