Overview
Evidently is an open-source Python library for evaluating, testing, and monitoring machine learning and LLM systems across their whole lifecycle, from early experiments to production. It works with both tabular and text data and supports predictive tasks like classification as well as generative tasks like RAG.
It is built for data scientists and ML engineers who need to measure output quality, catch data drift, and check regressions over time. The library ships with over 100 built-in metrics, ranging from data drift detection to LLM-as-a-judge evaluators, and lets you add custom metrics through a Python interface.
As an LLM observability and monitoring tool, Evidently is modular. You can start with one-off Reports and Test Suites, then move up to a self-hosted Monitoring UI that tracks metrics and test results over time. Its open architecture lets you export results as JSON, HTML, or a Python dictionary and plug into existing tools.
What it does
- Over 100 built-in metrics covering data drift, ML quality, and LLM evaluation
- Reports that compute and summarize data, ML, and LLM quality checks, viewable in Python or exportable to JSON, HTML, or a dictionary
- Test Suites that add pass/fail conditions to any Report for regression testing, CI/CD checks, and data validation
- Row-level descriptors such as Sentiment, TextLength, and Contains for evaluating text outputs
- Self-hostable Monitoring UI to visualize metrics and test results over time
- Python interface for custom metrics and support for LLM-as-a-judge evaluators
Getting started
Install Evidently from PyPI, then run a small text-evaluation Report on a toy dataset.
Install Evidently
Install the library from PyPI. A Conda install via conda-forge is also available.
pip install evidentlyBuild an LLM evaluation dataset
Create a small DataFrame of questions and answers, then wrap it in an Evidently Dataset with row-level descriptors that score sentiment, text length, and denial phrases.
import pandas as pd
from evidently import Report
from evidently import Dataset, DataDefinition
from evidently.descriptors import Sentiment, TextLength, Contains
from evidently.presets import TextEvals
eval_df = pd.DataFrame([
["What is the capital of Japan?", "The capital of Japan is Tokyo."],
["Who painted the Mona Lisa?", "Leonardo da Vinci."],
["Can you write an essay?", "I'm sorry, but I can't assist with homework."]],
columns=["question", "answer"])
eval_dataset = Dataset.from_pandas(pd.DataFrame(eval_df),
data_definition=DataDefinition(),
descriptors=[
Sentiment("answer", alias="Sentiment"),
TextLength("answer", alias="Length"),
Contains("answer", items=['sorry', 'apologize'], mode="any", alias="Denials")
])Run a Report
Run a TextEvals Report to see the distribution of scores. You can view it in a notebook or export it as JSON or a dictionary.
report = Report([
TextEvals()
])
my_eval = report.run(eval_dataset)
my_eval
# my_eval.json()
# my_eval.dict()Launch the monitoring dashboard (optional)
If you have uv installed, you can start the self-hosted Evidently UI with demo projects in a single command.
uv run --with evidently evidently ui --demo-projects allCommands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Run LLM output evaluations such as sentiment, length, and denial checks during prompt and RAG experiments
- Detect data drift by comparing current production data against a reference dataset
- Add pass/fail Test Suites to CI/CD pipelines for ML regression testing and data validation
- Self-host a monitoring dashboard to track model and LLM quality metrics over time
How Evidently compares
Evidently alongside other open-source observability & llmops tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Langfuse | ★ 29.4k | A self-hostable platform for tracing LLM and agent calls, managing prompts, and running evaluations to debug and improve AI applications. |
| Opik | ★ 19.7k | An open-source platform from Comet for tracing, evaluating, and monitoring LLM applications, RAG systems, and agent workflows with dashboards and LLM-as-judge metrics. |
| TensorZero | ★ 11.7k | An open-source LLMOps platform that puts a single gateway in front of every major LLM provider and adds observability, evaluation, optimization, and A/B testing. |
| Evidently | ★ 7.6k | Evaluate, test, and monitor ML and LLM systems from experiments to production |
| OpenLLMetry | ★ 7.2k | An OpenTelemetry-based SDK that auto-instruments LLM providers, vector databases, and frameworks so traces flow into any existing observability backend. |
| Helicone | ★ 5.8k | A proxy-based observability platform that logs, monitors, and evaluates LLM API calls by routing requests through its endpoint with one line of code. |
| AgentOps | ★ 5.6k | An SDK for monitoring AI agents that tracks LLM cost, session replays, and performance across frameworks like CrewAI, LangChain, and the OpenAI Agents SDK. |
| Pydantic Logfire | ★ 4.3k | An observability platform from the Pydantic team that records LLM calls, agent runs, and tool invocations with tokens, cost, and latency attached. |