Arize Phoenix

Open-source observability and evaluation for LLM and agent apps

github.com/Arize-ai/phoenix★ 10.2k phoenix.arize.com

Overview

Arize Phoenix is an open-source AI observability platform for experimenting with, evaluating, and troubleshooting LLM and agent applications. It traces your app's runtime with OpenTelemetry-based instrumentation, then lets you run evaluations on those traces to find where things go wrong.

It is built for developers and ML teams who need to see what their LLM application is actually doing in development and production. Phoenix is vendor and language agnostic, with out-of-the-box support for frameworks like LangGraph, LlamaIndex, DSPy, CrewAI, the OpenAI Agents SDK, and the Claude Agent SDK, plus providers such as OpenAI, Anthropic, Google GenAI, and AWS Bedrock.

As an eval framework, it pairs tracing with response and retrieval evals, versioned datasets, and experiments so you can measure changes to prompts, models, and retrieval. It runs on your local machine, in a Jupyter notebook, in a container, or in the cloud.

What it does

Tracing of LLM and agent runtime using OpenTelemetry-based instrumentation
LLM-based evaluation with response and retrieval evals to benchmark performance
Versioned datasets of examples for experimentation, evaluation, and fine-tuning
Experiments to track and compare changes to prompts, LLMs, and retrieval
Prompt playground and prompt management with version control, tagging, and replay
Auto-instrumentation for popular frameworks and LLM providers via OpenInference

Getting started

Install the Phoenix package, start the local server, then point your application's instrumentation at it.

Install Phoenix

Install the full Phoenix platform from PyPI. A conda-forge build is also available.

bashbash

pip install arize-phoenix

Start the Phoenix server

Launch the server from your terminal. The Phoenix UI opens at http://localhost:6006, where you can view traces and run evals.

bashbash

phoenix serve

Send traces from your app

Instrument your LLM or agent code with OpenInference auto-instrumentation so spans flow into Phoenix. See the docs for the integration matching your framework or provider (OpenAI, Anthropic, LangGraph, LlamaIndex, and others).

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Debugging an agent that calls tools by tracing each step and inspecting where a run fails
Running response and retrieval evals on a RAG pipeline to measure answer and retrieval quality
Comparing prompt or model changes across versioned datasets using experiments before shipping
Monitoring LLM application behavior in production to catch regressions and troubleshoot issues

How Arize Phoenix compares

Arize Phoenix alongside other open-source evaluation & red-teaming tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Strix	★ 26.1k	Strix runs autonomous AI agents that act like hackers, dynamically running your code to find vulnerabilities and validate them with real proof-of-concepts.
promptfoo	★ 22.4k	A developer-first CLI and library for testing and comparing prompts and models, with red-teaming probes for prompt injection, PII leaks, and other vulnerabilities.
OpenAI Evals	★ 18.7k	A framework and open registry for building and running evaluations of LLMs and LLM-based systems, including prompt chains and tool-using agents.
DeepEval	★ 16.3k	An open-source Python framework that tests LLM apps like unit tests, with 50+ metrics for RAG, agents, chatbots, and safety, and a Pytest integration for CI/CD.
Ragas	★ 14.4k	An evaluation toolkit focused on retrieval-augmented generation that scores answer faithfulness, context precision/recall, and relevancy, often without needing ground-truth labels.
Arize Phoenix	★ 10.2k	Open-source observability and evaluation for LLM and agent apps
garak	★ 8.2k	An LLM vulnerability scanner from NVIDIA with 100+ attack probes that test models for prompt injection, data leakage, jailbreaks, and other security weaknesses.
Giskard	★ 5.4k	An open-source library for testing and scanning LLM and ML models for issues like hallucination, bias, and toxicity, including multi-turn agent testing and a vulnerability scanner.

// Overview

// What it does

// Getting started