Evidently

Evaluate, test, and monitor ML and LLM systems from experiments to production

github.com/evidentlyai/evidently★ 7.6k evidentlyai.com

Overview

Evidently is an open-source Python library for evaluating, testing, and monitoring machine learning and LLM systems across their whole lifecycle, from early experiments to production. It works with both tabular and text data and supports predictive tasks like classification as well as generative tasks like RAG.

It is built for data scientists and ML engineers who need to measure output quality, catch data drift, and check regressions over time. The library ships with over 100 built-in metrics, ranging from data drift detection to LLM-as-a-judge evaluators, and lets you add custom metrics through a Python interface.

As an LLM observability and monitoring tool, Evidently is modular. You can start with one-off Reports and Test Suites, then move up to a self-hosted Monitoring UI that tracks metrics and test results over time. Its open architecture lets you export results as JSON, HTML, or a Python dictionary and plug into existing tools.

What it does

Over 100 built-in metrics covering data drift, ML quality, and LLM evaluation
Reports that compute and summarize data, ML, and LLM quality checks, viewable in Python or exportable to JSON, HTML, or a dictionary
Test Suites that add pass/fail conditions to any Report for regression testing, CI/CD checks, and data validation
Row-level descriptors such as Sentiment, TextLength, and Contains for evaluating text outputs
Self-hostable Monitoring UI to visualize metrics and test results over time
Python interface for custom metrics and support for LLM-as-a-judge evaluators

Getting started

Install Evidently from PyPI, then run a small text-evaluation Report on a toy dataset.

Install Evidently

Install the library from PyPI. A Conda install via conda-forge is also available.

bashbash

pip install evidently

Build an LLM evaluation dataset

Create a small DataFrame of questions and answers, then wrap it in an Evidently Dataset with row-level descriptors that score sentiment, text length, and denial phrases.

pythonpython

import pandas as pd
from evidently import Report
from evidently import Dataset, DataDefinition
from evidently.descriptors import Sentiment, TextLength, Contains
from evidently.presets import TextEvals

eval_df = pd.DataFrame([
    ["What is the capital of Japan?", "The capital of Japan is Tokyo."],
    ["Who painted the Mona Lisa?", "Leonardo da Vinci."],
    ["Can you write an essay?", "I'm sorry, but I can't assist with homework."]],
    columns=["question", "answer"])

eval_dataset = Dataset.from_pandas(pd.DataFrame(eval_df),
    data_definition=DataDefinition(),
    descriptors=[
        Sentiment("answer", alias="Sentiment"),
        TextLength("answer", alias="Length"),
        Contains("answer", items=['sorry', 'apologize'], mode="any", alias="Denials")
    ])

Run a Report

Run a TextEvals Report to see the distribution of scores. You can view it in a notebook or export it as JSON or a dictionary.

pythonpython

report = Report([
    TextEvals()
])

my_eval = report.run(eval_dataset)
my_eval
# my_eval.json()
# my_eval.dict()

Launch the monitoring dashboard (optional)

If you have uv installed, you can start the self-hosted Evidently UI with demo projects in a single command.

bashbash

uv run --with evidently evidently ui --demo-projects all

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Run LLM output evaluations such as sentiment, length, and denial checks during prompt and RAG experiments
Detect data drift by comparing current production data against a reference dataset
Add pass/fail Test Suites to CI/CD pipelines for ML regression testing and data validation
Self-host a monitoring dashboard to track model and LLM quality metrics over time

How Evidently compares

Evidently alongside other open-source observability & llmops tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Langfuse	★ 29.4k	A self-hostable platform for tracing LLM and agent calls, managing prompts, and running evaluations to debug and improve AI applications.
Opik	★ 19.7k	An open-source platform from Comet for tracing, evaluating, and monitoring LLM applications, RAG systems, and agent workflows with dashboards and LLM-as-judge metrics.
TensorZero	★ 11.7k	An open-source LLMOps platform that puts a single gateway in front of every major LLM provider and adds observability, evaluation, optimization, and A/B testing.
Evidently	★ 7.6k	Evaluate, test, and monitor ML and LLM systems from experiments to production
OpenLLMetry	★ 7.2k	An OpenTelemetry-based SDK that auto-instruments LLM providers, vector databases, and frameworks so traces flow into any existing observability backend.
Helicone	★ 5.8k	A proxy-based observability platform that logs, monitors, and evaluates LLM API calls by routing requests through its endpoint with one line of code.
AgentOps	★ 5.6k	An SDK for monitoring AI agents that tracks LLM cost, session replays, and performance across frameworks like CrewAI, LangChain, and the OpenAI Agents SDK.
Pydantic Logfire	★ 4.3k	An observability platform from the Pydantic team that records LLM calls, agent runs, and tool invocations with tokens, cost, and latency attached.

// Overview

// What it does

// Getting started