What Is OpenAI Evals? Open-Source Eval Framework

You will understand what the OpenAI Evals framework does and how the open-source repo relates to the hosted Evals API.

INTERMEDIATE9 MIN READUPDATED 2026-06-14

openai/evals18.7k DOCSplatform.openai.com

In plain English

When you build software, you write tests so you can change the code without breaking it. OpenAI Evals brings that same idea to language models: it is an open-source framework for writing repeatable tests that check whether a model gives the answers you expect. You define a set of example inputs, what a good output looks like, and how to grade the model's response. Then you can run that test on any model, any prompt, any time.

OpenAI Evals — illustration — OpenAI Evals — comet.com

Think of it like a standardized exam with an answer key. The exam questions are your dataset of inputs. The answer key is the expected output for each one. The grader is the teacher who marks each response right or wrong. OpenAI Evals gives you a clean way to write all three, run the exam, and get back a score — instead of pasting prompts into a chat window one by one and squinting at the results.

Why it matters

A language model is non-deterministic and sensitive to tiny changes. Reword a prompt, swap to a cheaper model, or bump a temperature setting, and the behavior can shift in ways you never intended. Without tests, you only find out when a user complains. LLM evals exist to catch those shifts before they ship, and OpenAI Evals was one of the first widely used frameworks to make that practical.

It catches regressions. Change a prompt or model and re-run the same eval. If the score drops, you broke something — and you know before it reaches production, not after.
It makes comparison fair. Running the same questions, with the same grading, across different models or prompts is the only honest way to say one is better. Eyeballing a handful of examples is not measurement.
It is a shared registry, not just a tool. The repo ships with a large library of existing evals contributed by the community. You can run a known benchmark out of the box, or copy one as a template for your own task.
It standardizes the moving parts. Instead of every team inventing its own grading scripts, an eval is a defined object: data, a prompt path, and a grader. That structure is what makes results reproducible and reviewable.

Who cares? Anyone shipping an LLM feature that has to keep working. Support bots, classifiers, extraction pipelines, coding assistants — if you cannot answer "did my last change make this better or worse?" with a number, you are flying blind. OpenAI Evals turns that question into a test you can run on demand.

How it works

At its core, an eval in this framework is built from a few pieces that fit together: a dataset of samples, an eval template that decides how the model is prompted and graded, and a runner that executes everything and reports a score. The model you are testing is sometimes called the completion function — anything that takes an input and returns a response.

The pieces of an eval

Samples. Each sample is one test case: an input (often a chat-style message list) and the ideal answer or grading criteria. These are usually stored as JSONL — one JSON object per line.
Eval template. Rather than writing grading logic from scratch every time, you pick a built-in template. The classic ones are Match / Includes / FuzzyMatch for checking the output against an expected string, and ModelGradedQA for letting another model judge open-ended answers.
Registry entry. A small YAML file names your eval, points at the dataset, and selects the template plus its arguments. Registering an eval is what lets you run it by name from the command line.
Runner. The oaieval command takes a model and an eval name, runs every sample, applies the grader, and writes a report with the aggregate score and a per-sample log.

// How a single eval run flows

DatasetJSONL of test casesPrompt modelcompletion functionGrade outputtemplate: match or model-gradedAggregatescore + per-sample log

Two ways to grade

The framework draws a clear line between two grading styles, and choosing the right one is most of the skill in writing a good eval — the same split covered in code-graded vs model-graded evals.

// Deterministic vs model-graded

Deterministic (Match)

Compares output to an exact / fuzzy expected string
Cheap, fast, fully repeatable
Great for classification, extraction, exact answers
Brittle when many phrasings are correct

Model-graded (ModelGradedQA)

A second model judges the answer against criteria
Handles open-ended, free-form responses
Flexible, but slower and costs tokens
Inherits the grader model's own biases

For a task like "classify this ticket as billing, bug, or other," a deterministic Includes check is perfect. For "summarize this document faithfully," there is no single correct string, so you reach for model-graded evaluation — the framework's built-in version of LLM-as-a-judge.

A worked example

Suppose you want to test whether a model correctly answers a few basic geography questions. First you write the dataset as JSONL — one sample per line, each with the input messages and the ideal answer the grader will match against.

capitals.jsonl (one object per line)json

{"input": [{"role": "system", "content": "Answer with only the city name."}, {"role": "user", "content": "What is the capital of France?"}], "ideal": "Paris"}
{"input": [{"role": "system", "content": "Answer with only the city name."}, {"role": "user", "content": "What is the capital of Japan?"}], "ideal": "Tokyo"}

Next you register the eval in a small YAML file. It gives the eval a name, picks the Match template (exact-string grading), and points at the dataset file.

registry entry for the evalyaml

capitals:
  id: capitals.dev.v0
  metrics: [accuracy]
capitals.dev.v0:
  class: evals.elsuite.basic.match:Match
  args:
    samples_jsonl: capitals/capitals.jsonl

Now you run it from the command line, naming the model and the eval. The runner sends each input to the model, compares the response to ideal, and prints an accuracy score plus a path to the detailed log.

run the evalbash

oaieval gpt-4 capitals

The open-source repo vs the hosted Evals API

This is the single most common point of confusion, so it is worth being precise. Both let you evaluate models, but they are different products you use in different ways.

	Open-source Evals (this article)	Hosted Evals API
What it is	A Python framework + public benchmark registry on GitHub	A managed evaluation service inside the OpenAI platform
Where it runs	On your own machine / CI, via the `oaieval` command	On OpenAI's servers, via API calls and the dashboard
Setup	Clone the repo, install, write JSONL + YAML	Define datasets and graders through the API or web UI
Best for	Custom logic, the shared registry, full local control	Quick managed runs tied to your production traffic and logs

A simple way to hold the two apart: the open-source repo is a toolkit and a shared library of benchmarks you operate yourself; the hosted Evals API is a service that runs evaluations for you on the platform. They are complementary, not competitors. If a tutorial says "clone the repo and run oaieval," it means the open-source framework. If it says "create an eval through the API or the dashboard," it means the hosted product.

Common pitfalls

An eval framework only helps if the eval itself is good. Most disappointing results trace back to the dataset or the grader, not the runner.

Too few samples. Five questions tell you almost nothing; one lucky run looks great and one unlucky run looks broken. You need enough cases to trust the number — see how many eval samples you need.
Brittle exact matching. Match fails if the model says "The capital is Paris." instead of "Paris". Use a system prompt that constrains the format, or switch to Includes / FuzzyMatch so reasonable phrasings still pass.
Blindly trusting the model grader. ModelGradedQA is convenient but the judge has its own blind spots and can be inconsistent. Spot-check its verdicts against your own judgment before you rely on the score.
A leaky or unrepresentative dataset. If your test cases do not look like real user inputs, a high score is meaningless. Build a golden dataset from real, varied examples.
Running it once and forgetting. The whole point is repetition. An eval that you do not re-run on every prompt or model change is not protecting you — wire it into a CI pipeline.

Going deeper

Once the basics click, the framework has more depth than the simple Match example suggests, and the wider ecosystem around it is worth knowing.

Completion functions. You are not limited to grading a single model call. A completion function can wrap any logic — a multi-step chain, a tool-using agent, a RAG pipeline — as long as it takes an input and returns a response. That means you can run the exact same eval against your whole system, not just the raw model, which is usually what you actually care about.

Custom eval classes. The built-in templates cover common cases, but you can subclass the base Eval to write arbitrary Python grading: parse structured output, run unit tests against generated code, call an external checker, or combine several metrics. This is where the framework stops being a string-matcher and becomes a general test harness.

The registry as a benchmark library. Beyond writing your own, the registry holds many community-contributed evals. Reading them is one of the best ways to learn what a well-built eval looks like, and you can run an existing benchmark to sanity-check a new model before investing in custom tests.

Where to go next. OpenAI Evals is one of several tools in this space. If you prefer a unit-test feel, look at pytest-style frameworks; if you want config-driven prompt comparison and red-teaming, look at CLI-based eval tools; for RAG-specific metrics, there are dedicated libraries. The concepts you learned here — dataset, prompt, grader, score — carry over to all of them. The durable lesson is that how you grade matters more than which framework you pick: a thoughtful dataset and an honest grader beat a fancy tool wrapped around a sloppy test set every time.

FAQ

What is OpenAI Evals?

OpenAI Evals is an open-source framework and shared benchmark registry for testing the behavior of language models. You define a dataset of inputs, the expected outputs, and a grading method, then run the eval to get a repeatable score. It lets you check whether a prompt or model change improves or breaks your results.

What is the difference between OpenAI Evals and the Evals API?

They are two separate things with a shared name. OpenAI Evals is the open-source repo (openai/evals) that you clone and run yourself with the oaieval command. The Evals API is a hosted service inside the OpenAI platform that runs evaluations for you on its servers, often tied to your production logs. The open-source repo is a self-operated toolkit; the API is a managed service.

What is a model-graded eval in OpenAI Evals?

A model-graded eval uses a second language model to judge an answer instead of matching it against a fixed string. The built-in ModelGradedQA template is the standard example. It is the right choice for open-ended tasks like summarization where there is no single correct output, but it is slower, costs tokens, and inherits the grader model's own biases.

Do I need OpenAI Evals to test only OpenAI models?

No. Although it was created by OpenAI, the framework grades any completion function — including other providers' models or your own multi-step systems — as long as it takes an input and returns a response. The grading templates and registry are independent of which model produces the answer.

How do I run an OpenAI eval?

You write your test cases as a JSONL dataset, register the eval in a small YAML file that names the dataset and a grading template, then run oaieval <model> <eval-name> from the command line. The runner sends each input to the model, grades the response, and prints an aggregate score plus a per-sample log.

// In plain English

// Why it matters

// How it works

The pieces of an eval

Two ways to grade

// A worked example

// The open-source repo vs the hosted Evals API

// Common pitfalls

// Going deeper

// FAQ

// Further reading

// Related