Overview
OpenAI Evals is a framework for evaluating large language models (LLMs) and systems built on top of them. It ships with an open registry of existing evals that test different dimensions of OpenAI models, and it lets you write your own custom evals for the use cases you care about.
It is aimed at developers building with LLMs who need a repeatable way to measure how different model versions affect their application. Many basic and model-graded evals require no evaluation code at all: you provide your data in JSON and set parameters in YAML. You can also build private evals against your own data without exposing it publicly.
Within the evaluation and testing space, Evals acts as a standard harness for running and comparing model behaviour. It supports more advanced setups, such as prompt chains and tool-using agents, through its Completion Function Protocol.
What it does
- Open registry of ready-made evals covering many dimensions of model behaviour
- Write custom evals, including model-graded evals defined in YAML
- Build private evals against your own data without publishing it
- Eval templates that need no evaluation code, just JSON data plus YAML parameters
- Completion Function Protocol for prompt chains and tool-using agents
- Optional logging of results to a Snowflake database or Weights & Biases
Getting started
You need Python 3.9 or newer and an OpenAI API key. Set the key once, then install the package depending on whether you want to run existing evals or create your own.
Set your OpenAI API key
Evals call the OpenAI API, so export your key as an environment variable before running anything. Be aware of the API usage costs.
export OPENAI_API_KEY=your-api-keyInstall to run existing evals
If you only want to run evals locally rather than contribute new ones, install the package from PyPI.
pip install evalsInstall from source to create evals
If you plan to write evals, clone the repo and install in editable mode so your changes take effect without reinstalling.
pip install -e .Fetch registry data
The evals registry is stored with Git-LFS. After installing LFS, pull the data files from within your local copy of the repo.
cd evals
git lfs fetch --all
git lfs pullCommands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Measure how a new model version changes the behaviour of your LLM application before you ship it
- Run existing registry evals to compare models across different tasks
- Build a private eval on your own production data without exposing it publicly
- Evaluate prompt chains or tool-using agents with the Completion Function Protocol
How OpenAI Evals compares
OpenAI Evals alongside other open-source evaluation & red-teaming tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Strix | ★ 26.1k | Strix runs autonomous AI agents that act like hackers, dynamically running your code to find vulnerabilities and validate them with real proof-of-concepts. |
| promptfoo | ★ 22.4k | A developer-first CLI and library for testing and comparing prompts and models, with red-teaming probes for prompt injection, PII leaks, and other vulnerabilities. |
| OpenAI Evals | ★ 18.7k | A framework and open registry for evaluating LLMs and LLM-based systems |
| DeepEval | ★ 16.3k | An open-source Python framework that tests LLM apps like unit tests, with 50+ metrics for RAG, agents, chatbots, and safety, and a Pytest integration for CI/CD. |
| Ragas | ★ 14.4k | An evaluation toolkit focused on retrieval-augmented generation that scores answer faithfulness, context precision/recall, and relevancy, often without needing ground-truth labels. |
| Arize Phoenix | ★ 10.2k | An open-source observability and evaluation tool for tracing LLM and agent behavior, running evals on traces, and troubleshooting issues in development and production. |
| garak | ★ 8.2k | An LLM vulnerability scanner from NVIDIA with 100+ attack probes that test models for prompt injection, data leakage, jailbreaks, and other security weaknesses. |
| Giskard | ★ 5.4k | An open-source library for testing and scanning LLM and ML models for issues like hallucination, bias, and toxicity, including multi-turn agent testing and a vulnerability scanner. |