What Is DeepEval? Pytest-Style LLM Evaluation

Q: Can I run DeepEval in CI/CD?

Yes — that's a core use case. Because `deepeval test run` behaves like Pytest, you add it to a pipeline in a couple of lines, and a metric falling below its threshold fails the build like a broken unit test. In practice teams run a small, fast eval set on each pull request and a larger suite nightly, since LLM-graded metrics cost time and money on every run.

You will understand how DeepEval frames LLM evaluation as unit tests and which metrics it provides out of the box.

INTERMEDIATE11 MIN READUPDATED 2026-06-14

confident-ai/deepeval16.3k DOCSdeepeval.com

In plain English

When you build normal software, you protect it with unit tests: small checks that say given this input, the output should equal that. Run them on every change, and a red light tells you the moment something breaks. The trouble with large language models is that they don't return one fixed answer. Ask the same question twice and the wording shifts, so assert output == "yes" is useless. You can't pin a model down to an exact string.

DeepEval — illustration — DeepEval — cdn.prod.website-files.com

DeepEval is an open-source framework that brings the unit-test feeling back to LLM work. People call it "Pytest for LLMs" because it looks and runs almost exactly like Pytest — you write test files, run them from the command line, and get a pass/fail report. The difference is how it decides pass or fail. Instead of comparing strings character by character, each DeepEval test attaches one or more metrics that score the output on meaning: is it faithful to the source, is it relevant to the question, did it hallucinate?

Think of a school exam. A multiple-choice test has one correct bubble — that's a traditional unit test. An essay question has no single right answer, so a teacher reads it against a rubric: does it answer the prompt, is it accurate, does it stay on topic? DeepEval is that teacher. It hands the model's answer to a rubric (often graded by another LLM) and turns the judgement into a number you can assert against, like the relevancy score must be at least 0.7.

Why it matters

Most LLM features start life judged "by vibes." Someone tweaks a prompt, eyeballs three or four answers, decides it looks better, and ships. That works until the app grows. Now a prompt change that fixes one case quietly breaks five others, and nobody notices until a user complains. Vibe-checking does not scale and it does not catch regressions — the silent backslide where today's edit undoes last week's fix.

DeepEval matters because it makes LLM quality measurable and repeatable. Once your behaviour lives in test cases with metrics, you get the same safety net software engineers have relied on for decades:

Catch regressions automatically. Change a prompt, model, or retrieval step, re-run the suite, and see exactly which cases got worse — before your users do.
Compare options objectively. Two prompts or two models go head-to-head on the same metrics, so "which is better?" becomes a number instead of an argument.
Gate deployments in CI. Wire the suite into your pipeline and a pull request that drops faithfulness below your threshold simply fails the build, the same way a failing unit test blocks a merge.
Make quality a team artifact. A shared test suite is documentation of what "good" means for your app, instead of living only in one person's head.

Who should care? Anyone shipping an LLM feature past the demo stage — chatbots, RAG pipelines, summarizers, agents. If your app's output reaches real users and you can't currently answer "did my last change make it better or worse?" with evidence, that gap is exactly what DeepEval fills. It sits in the broader world of LLM evaluation; its particular flavour is the developer-friendly, test-first one.

How it works

DeepEval is built from two simple objects: the test case (what happened) and the metric (how to score it). You package your data into test cases, attach metrics, and run them — exactly like assembling assertions in Pytest.

The test case: four standard fields

A DeepEval test case is a small record describing one interaction. The same handful of fields cover almost every LLM app, which is what keeps metrics reusable across projects:

Field	What it holds	Example
`input`	What you sent the model	"What's the refund window?"
`actual_output`	What the model actually said	"You have 30 days to return items."
`expected_output`	The ideal answer (optional)	"Refunds are accepted within 30 days."
`retrieval_context`	Chunks a RAG step fetched (optional)	["Refunds accepted within 30 days..."]

Not every metric needs every field. A relevancy check only looks at input and actual_output. A faithfulness check for RAG also reads retrieval_context to see whether the answer stuck to the retrieved text. A metric that needs a gold answer reads expected_output. You fill in the fields a metric requires and leave the rest empty.

The metric: a scorer with a threshold

A metric reads the relevant fields, produces a score between 0 and 1, and compares it to a threshold you set. Score at or above the threshold means the metric passes; below means it fails. Many DeepEval metrics are LLM-graded — under the hood they ask an evaluation model to judge the output against a rubric and explain its reasoning — while a few are computed with plain code. Either way, the test case either passes all its metrics or the test goes red.

// One DeepEval test, end to end

Build test caseinput + actual_output (+ context)Attach metrice.g. AnswerRelevancy, threshold 0.7Scoremetric returns 0–1 + reasonAssertscore ≥ threshold?Pass / failreport like Pytest

Here is the same idea in code. If you've written a Pytest test, this will look instantly familiar — the only new piece is assert_test, which runs the metrics and fails the test if any score falls short.

test_refund_bot.pypython

from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric

def test_refund_answer():
    # 1) Capture what happened in one test case.
    test_case = LLMTestCase(
        input="What's the refund window for physical items?",
        actual_output=my_app("What's the refund window for physical items?"),
    )

    # 2) Define a metric: it must score >= 0.7 to pass.
    relevancy = AnswerRelevancyMetric(threshold=0.7)

    # 3) Run it. Like a Pytest assert, this fails the test if the
    #    score is below the threshold.
    assert_test(test_case, [relevancy])

You run this with the DeepEval CLI, which wraps Pytest:

bashbash

deepeval test run test_refund_bot.py

The built-in metrics worth knowing

DeepEval ships a library of ready-made metrics so you don't write graders from scratch. You rarely use them all — pick the two or three that match your app. The headline ones beginners reach for most:

Metric	Question it answers	Typical use
G-Eval	Does it meet my custom criteria, in plain English?	Anything with a bespoke rubric
Answer Relevancy	Does the answer actually address the question?	Chatbots, Q&A
Faithfulness	Does the answer stick to the retrieved context?	RAG pipelines
Hallucination	Did it state things the context doesn't support?	Grounded / factual apps
Contextual Relevancy	Did retrieval fetch on-topic chunks?	Debugging the retriever in RAG

G-Eval deserves a special mention because it's the flexible one. Instead of a fixed rule, you describe your evaluation criteria in ordinary language — "check whether the response is polite and never reveals internal pricing" — and G-Eval has an LLM build step-by-step reasoning and produce a calibrated score. It's the escape hatch for quality dimensions that no off-the-shelf metric captures. (G-Eval comes from a research method DeepEval popularized; you can read more in G-Eval explained if your taxonomy includes it.)

a custom G-Eval metricpython

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

politeness = GEval(
    name="Politeness",
    criteria="Determine whether the response stays polite and "
             "professional, even if the user is rude.",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.8,
)

DeepEval vs Ragas: which test-first tool?

The other open-source eval library beginners meet is Ragas, and they overlap enough to cause confusion. The short version: Ragas is laser-focused on scoring RAG quality, while DeepEval is a broader, developer-workflow framework that also covers RAG. They are not rivals so much as different shapes.

// Two popular open-source eval libraries

DeepEval

Pytest-style test cases + assertions
Broad metric set (incl. G-Eval, agents)
CLI + CI integration built in
Great when evals are part of your dev loop

Ragas

Dataset-and-metrics, notebook-friendly
Deep, RAG-specific metric coverage
Pairs naturally with data pipelines
Great when RAG quality is the whole question

A reasonable rule of thumb: if you want evals to feel like running your test suite and to live inside CI alongside your code, DeepEval's test-first shape fits naturally. If your job is squarely "how good is my retrieval pipeline?" and you live in notebooks and dataframes, Ragas is purpose-built for that. Plenty of teams use both, and the underlying ideas — faithfulness, relevancy, code-graded versus model-graded checks — carry over between them.

Putting it in CI

The payoff of a test-first framework is automation. Because deepeval test run behaves like Pytest, dropping it into a pipeline is a couple of lines. A failed metric becomes a failed build, and a pull request that regresses quality is blocked the same way a broken unit test blocks a merge.

.github/workflows/evals.yml (sketch)yaml

name: LLM evals
on: [pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install deepeval
      # An eval model key is provided via repository secrets.
      - run: deepeval test run tests/

Two cautions make this practical. First, LLM-graded metrics cost money and time on every run, so a giant suite on every commit gets slow and expensive — most teams run a small, fast "smoke" set on each pull request and a larger nightly suite. Second, scores from a model judge wobble a little run to run, so set thresholds with a margin and don't treat a 0.01 dip as a real regression. For the wider picture, see building an LLM eval suite and wiring evals into CI/CD.

Going deeper

Once the test-case-plus-metric pattern clicks, DeepEval opens up in a few directions worth knowing as your needs grow.

Bulk evaluation and datasets. Asserting one case at a time is fine for spot checks, but real evaluation runs a dataset of many cases and reports aggregate scores. DeepEval can loop a metric over a whole collection of test cases so you measure average faithfulness across hundreds of examples, not just one. This is where eval stops being a unit test and starts being a benchmark of your app — and where sample size starts to matter for trusting the numbers.

Agent and multi-step metrics. Beyond single question-answer pairs, DeepEval includes metrics aimed at agents and tool use — checking whether the right tool was called or whether a multi-turn task actually completed. These reflect that modern LLM apps are pipelines, not one prompt, so a single relevancy score can't tell the whole story.

Custom metrics. When neither a built-in metric nor G-Eval fits, you can write your own metric class with your own scoring logic — code-graded, model-graded, or a blend. This keeps the framework open-ended: anything you can express as "read these fields, return a score and a threshold" becomes a first-class DeepEval metric.

The honest limits. DeepEval makes quality measurable, not objective. Most of its headline metrics lean on an LLM judge, so they inherit that judge's blind spots and a touch of randomness — your eval is only as trustworthy as the grader and the dataset behind it. Treat scores as a strong signal that flags regressions and ranks options, not as ground truth. The durable lesson is the same one that holds for all LLM evaluation: a good test suite tells you where to look, and a human still decides what "good enough" means.

FAQ

What is DeepEval used for?

DeepEval is an open-source framework for testing and evaluating LLM applications the way you write unit tests. You package each interaction into a test case (input, actual output, optional expected output and retrieval context) and attach metrics like answer relevancy, faithfulness, or hallucination. Each metric scores the output and passes or fails against a threshold, so you can catch quality regressions automatically in CI.

Why is DeepEval called 'Pytest for LLMs'?

Because it deliberately mirrors Pytest. You write test files, run them from the command line with deepeval test run, and get a familiar pass/fail report. The difference is that instead of comparing exact strings, each test asserts on a metric score that judges the output's meaning, which is what LLM outputs need since their wording changes every time.

What's the difference between DeepEval and Ragas?

Ragas is focused specifically on scoring RAG pipeline quality and fits naturally into notebooks and data pipelines. DeepEval is a broader, developer-workflow framework with a Pytest-style test-case model and built-in CI integration that also covers RAG. Use DeepEval when you want evals to run like your test suite; use Ragas when RAG retrieval quality is the whole question. Many teams use both.

What is G-Eval in DeepEval?

G-Eval is DeepEval's flexible, custom metric. You describe your evaluation criteria in plain English, and an LLM judge generates step-by-step reasoning and a calibrated score against that rubric. It's the go-to when no off-the-shelf metric captures the quality dimension you care about, such as tone, policy compliance, or domain-specific correctness.

Does DeepEval need an API key?

For purely code-based metrics, no. But many of DeepEval's headline metrics — including G-Eval, faithfulness, and answer relevancy — are graded by an LLM, so those need access to an evaluation model. You configure a judge model (for example a Claude or GPT model) once, and pick a capable one, since a weak grader produces noisy scores.

Can I run DeepEval in CI/CD?

Yes — that's a core use case. Because deepeval test run behaves like Pytest, you add it to a pipeline in a couple of lines, and a metric falling below its threshold fails the build like a broken unit test. In practice teams run a small, fast eval set on each pull request and a larger suite nightly, since LLM-graded metrics cost time and money on every run.

// In plain English

// Why it matters

// How it works

The test case: four standard fields

The metric: a scorer with a threshold

// The built-in metrics worth knowing

// DeepEval vs Ragas: which test-first tool?

// Putting it in CI

// Going deeper

// FAQ

// Further reading

// Related

In plain English

Why it matters

How it works

The built-in metrics worth knowing

DeepEval vs Ragas: which test-first tool?

Putting it in CI

Going deeper

FAQ

Further reading

Related