DeepEval

Unit-test your LLM apps with 50+ evaluation metrics

github.com/confident-ai/deepeval★ 16.3k deepeval.com

Overview

DeepEval is an open-source Python framework for evaluating large-language-model systems. It works much like Pytest, but is built for unit-testing LLM apps: you write test cases, attach metrics, and assert that your model's output meets a quality threshold.

It ships with a large set of ready-to-use metrics powered by LLM-as-a-judge and other NLP models that run locally on your machine, including G-Eval, answer relevancy, faithfulness, hallucination, and task completion. Whether you build RAG pipelines, agents, or chatbots, with LangChain or the OpenAI SDK, you can score the parts that matter for each.

It fits in the evaluation and testing layer of an AI stack. Because tests run through the deepeval CLI, you can wire them into CI/CD to catch quality regressions, compare prompts and models, and decide whether a change is safe to ship.

What it does

50+ ready-to-use metrics with explanations, including G-Eval, DAG, answer relevancy, faithfulness, hallucination, and task completion
Pytest-style API: write test cases with LLMTestCase and assert them against metrics
Runs metrics locally using LLM-as-a-judge and NLP models, with any LLM of your choice as the judge
Dedicated metric groups for RAG, agentic, and multi-turn (chatbot) use cases
CI/CD friendly via the deepeval test run CLI command
Works with apps built on LangChain or OpenAI, so you can compare prompts, models, and architectures

Getting started

Install DeepEval, set your judge model's API key, write a test case, and run it with the deepeval CLI.

Install DeepEval

Install the package from PyPI.

bashbash

pip install -U deepeval

Set your judge model key

Many metrics use an LLM as a judge. Set the OpenAI API key (or configure another model) before running.

bashbash

export OPENAI_API_KEY="..."

Write a test case

Create a file (for example test_chatbot.py) with a metric and an LLMTestCase, then assert it.

pythonpython

import pytest
from deepeval import assert_test
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, SingleTurnParams

def test_case():
    correctness_metric = GEval(
        name="Correctness",
        criteria="Determine if the 'actual output' is correct based on the 'expected output'.",
        evaluation_params=[SingleTurnParams.ACTUAL_OUTPUT, SingleTurnParams.EXPECTED_OUTPUT],
        threshold=0.5
    )
    test_case = LLMTestCase(
        input="What if these shoes don't fit?",
        actual_output="You have 30 days to get a full refund at no extra cost.",
        expected_output="We offer a 30-day full refund at no extra costs.",
        retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
    )
    assert_test(test_case, [correctness_metric])

Run the tests

Use the deepeval CLI to run your test file.

bashbash

deepeval test run test_chatbot.py

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Add quality checks to CI/CD so prompt or model changes that lower output quality fail the build
Evaluate a RAG pipeline for answer relevancy, faithfulness, and contextual recall
Score agent behavior such as task completion and tool correctness
Compare prompts and models (for example moving from OpenAI to Claude) before shipping a change

How DeepEval compares

DeepEval alongside other open-source evaluation & red-teaming tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Strix	★ 26.1k	Strix runs autonomous AI agents that act like hackers, dynamically running your code to find vulnerabilities and validate them with real proof-of-concepts.
promptfoo	★ 22.4k	A developer-first CLI and library for testing and comparing prompts and models, with red-teaming probes for prompt injection, PII leaks, and other vulnerabilities.
OpenAI Evals	★ 18.7k	A framework and open registry for building and running evaluations of LLMs and LLM-based systems, including prompt chains and tool-using agents.
DeepEval	★ 16.3k	Unit-test your LLM apps with 50+ evaluation metrics
Ragas	★ 14.4k	An evaluation toolkit focused on retrieval-augmented generation that scores answer faithfulness, context precision/recall, and relevancy, often without needing ground-truth labels.
Arize Phoenix	★ 10.2k	An open-source observability and evaluation tool for tracing LLM and agent behavior, running evals on traces, and troubleshooting issues in development and production.
garak	★ 8.2k	An LLM vulnerability scanner from NVIDIA with 100+ attack probes that test models for prompt injection, data leakage, jailbreaks, and other security weaknesses.
Giskard	★ 5.4k	An open-source library for testing and scanning LLM and ML models for issues like hallucination, bias, and toxicity, including multi-turn agent testing and a vulnerability scanner.

// Overview

// What it does

// Getting started

Install DeepEval

Set your judge model key

Write a test case

Run the tests

// When to use it

// How DeepEval compares

Overview

What it does

Getting started

When to use it

How DeepEval compares