AI/TLDR

DeepEval

Unit-test your LLM apps with 50+ evaluation metrics

Overview

DeepEval is an open-source Python framework for evaluating large-language-model systems. It works much like Pytest, but is built for unit-testing LLM apps: you write test cases, attach metrics, and assert that your model's output meets a quality threshold.

It ships with a large set of ready-to-use metrics powered by LLM-as-a-judge and other NLP models that run locally on your machine, including G-Eval, answer relevancy, faithfulness, hallucination, and task completion. Whether you build RAG pipelines, agents, or chatbots, with LangChain or the OpenAI SDK, you can score the parts that matter for each.

It fits in the evaluation and testing layer of an AI stack. Because tests run through the deepeval CLI, you can wire them into CI/CD to catch quality regressions, compare prompts and models, and decide whether a change is safe to ship.

What it does

  • 50+ ready-to-use metrics with explanations, including G-Eval, DAG, answer relevancy, faithfulness, hallucination, and task completion
  • Pytest-style API: write test cases with LLMTestCase and assert them against metrics
  • Runs metrics locally using LLM-as-a-judge and NLP models, with any LLM of your choice as the judge
  • Dedicated metric groups for RAG, agentic, and multi-turn (chatbot) use cases
  • CI/CD friendly via the deepeval test run CLI command
  • Works with apps built on LangChain or OpenAI, so you can compare prompts, models, and architectures

Getting started

Install DeepEval, set your judge model's API key, write a test case, and run it with the deepeval CLI.

Install DeepEval

Install the package from PyPI.

bashbash
pip install -U deepeval

Set your judge model key

Many metrics use an LLM as a judge. Set the OpenAI API key (or configure another model) before running.

bashbash
export OPENAI_API_KEY="..."

Write a test case

Create a file (for example test_chatbot.py) with a metric and an LLMTestCase, then assert it.

pythonpython
import pytest
from deepeval import assert_test
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, SingleTurnParams

def test_case():
    correctness_metric = GEval(
        name="Correctness",
        criteria="Determine if the 'actual output' is correct based on the 'expected output'.",
        evaluation_params=[SingleTurnParams.ACTUAL_OUTPUT, SingleTurnParams.EXPECTED_OUTPUT],
        threshold=0.5
    )
    test_case = LLMTestCase(
        input="What if these shoes don't fit?",
        actual_output="You have 30 days to get a full refund at no extra cost.",
        expected_output="We offer a 30-day full refund at no extra costs.",
        retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
    )
    assert_test(test_case, [correctness_metric])

Run the tests

Use the deepeval CLI to run your test file.

bashbash
deepeval test run test_chatbot.py

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Add quality checks to CI/CD so prompt or model changes that lower output quality fail the build
  • Evaluate a RAG pipeline for answer relevancy, faithfulness, and contextual recall
  • Score agent behavior such as task completion and tool correctness
  • Compare prompts and models (for example moving from OpenAI to Claude) before shipping a change

How DeepEval compares

DeepEval alongside other open-source evaluation & red-teaming tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Strix★ 26.1kStrix runs autonomous AI agents that act like hackers, dynamically running your code to find vulnerabilities and validate them with real proof-of-concepts.
promptfoo★ 22.4kA developer-first CLI and library for testing and comparing prompts and models, with red-teaming probes for prompt injection, PII leaks, and other vulnerabilities.
OpenAI Evals★ 18.7kA framework and open registry for building and running evaluations of LLMs and LLM-based systems, including prompt chains and tool-using agents.
DeepEval★ 16.3kUnit-test your LLM apps with 50+ evaluation metrics
Ragas★ 14.4kAn evaluation toolkit focused on retrieval-augmented generation that scores answer faithfulness, context precision/recall, and relevancy, often without needing ground-truth labels.
Arize Phoenix★ 10.2kAn open-source observability and evaluation tool for tracing LLM and agent behavior, running evals on traces, and troubleshooting issues in development and production.
garak★ 8.2kAn LLM vulnerability scanner from NVIDIA with 100+ attack probes that test models for prompt injection, data leakage, jailbreaks, and other security weaknesses.
Giskard★ 5.4kAn open-source library for testing and scanning LLM and ML models for issues like hallucination, bias, and toxicity, including multi-turn agent testing and a vulnerability scanner.