Overview
Arize Phoenix is an open-source AI observability platform for experimenting with, evaluating, and troubleshooting LLM and agent applications. It traces your app's runtime with OpenTelemetry-based instrumentation, then lets you run evaluations on those traces to find where things go wrong.
It is built for developers and ML teams who need to see what their LLM application is actually doing in development and production. Phoenix is vendor and language agnostic, with out-of-the-box support for frameworks like LangGraph, LlamaIndex, DSPy, CrewAI, the OpenAI Agents SDK, and the Claude Agent SDK, plus providers such as OpenAI, Anthropic, Google GenAI, and AWS Bedrock.
As an eval framework, it pairs tracing with response and retrieval evals, versioned datasets, and experiments so you can measure changes to prompts, models, and retrieval. It runs on your local machine, in a Jupyter notebook, in a container, or in the cloud.
What it does
- Tracing of LLM and agent runtime using OpenTelemetry-based instrumentation
- LLM-based evaluation with response and retrieval evals to benchmark performance
- Versioned datasets of examples for experimentation, evaluation, and fine-tuning
- Experiments to track and compare changes to prompts, LLMs, and retrieval
- Prompt playground and prompt management with version control, tagging, and replay
- Auto-instrumentation for popular frameworks and LLM providers via OpenInference
Getting started
Install the Phoenix package, start the local server, then point your application's instrumentation at it.
Install Phoenix
Install the full Phoenix platform from PyPI. A conda-forge build is also available.
pip install arize-phoenixStart the Phoenix server
Launch the server from your terminal. The Phoenix UI opens at http://localhost:6006, where you can view traces and run evals.
phoenix serveSend traces from your app
Instrument your LLM or agent code with OpenInference auto-instrumentation so spans flow into Phoenix. See the docs for the integration matching your framework or provider (OpenAI, Anthropic, LangGraph, LlamaIndex, and others).
Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Debugging an agent that calls tools by tracing each step and inspecting where a run fails
- Running response and retrieval evals on a RAG pipeline to measure answer and retrieval quality
- Comparing prompt or model changes across versioned datasets using experiments before shipping
- Monitoring LLM application behavior in production to catch regressions and troubleshoot issues
How Arize Phoenix compares
Arize Phoenix alongside other open-source evaluation & red-teaming tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Strix | ★ 26.1k | Strix runs autonomous AI agents that act like hackers, dynamically running your code to find vulnerabilities and validate them with real proof-of-concepts. |
| promptfoo | ★ 22.4k | A developer-first CLI and library for testing and comparing prompts and models, with red-teaming probes for prompt injection, PII leaks, and other vulnerabilities. |
| OpenAI Evals | ★ 18.7k | A framework and open registry for building and running evaluations of LLMs and LLM-based systems, including prompt chains and tool-using agents. |
| DeepEval | ★ 16.3k | An open-source Python framework that tests LLM apps like unit tests, with 50+ metrics for RAG, agents, chatbots, and safety, and a Pytest integration for CI/CD. |
| Ragas | ★ 14.4k | An evaluation toolkit focused on retrieval-augmented generation that scores answer faithfulness, context precision/recall, and relevancy, often without needing ground-truth labels. |
| Arize Phoenix | ★ 10.2k | Open-source observability and evaluation for LLM and agent apps |
| garak | ★ 8.2k | An LLM vulnerability scanner from NVIDIA with 100+ attack probes that test models for prompt injection, data leakage, jailbreaks, and other security weaknesses. |
| Giskard | ★ 5.4k | An open-source library for testing and scanning LLM and ML models for issues like hallucination, bias, and toxicity, including multi-turn agent testing and a vulnerability scanner. |