AI/TLDR

What Is Arize Phoenix? Local LLM Observability

After reading, you'll understand what Arize Phoenix is, how it lets you trace and evaluate LLM apps locally, and where it fits next to hosted observability tools.

INTERMEDIATE10 MIN READUPDATED 2026-06-14

In plain English

When you build an app on top of a large language model, a lot happens that you never see. A user asks a question, your code retrieves some documents, stuffs them into a prompt, calls the model, maybe calls a tool, then calls the model again. When the final answer is wrong, where did it go wrong? Without a record of every step, you are debugging blind.

Arize Phoenix — illustration
Arize Phoenix — arize.com

Arize Phoenix is an open-source tool that records all of those hidden steps and shows them to you in a UI you can open in your browser. It does two things: observability (it traces every call your app makes, so you can replay exactly what happened) and evaluation (it scores those traces for quality — was the retrieved context relevant? did the model hallucinate?). The headline trait that sets it apart: it runs locally, with no account and no signup. You install it, start it, and a dashboard pops up on your own machine.

Think of a black-box flight recorder for your AI app. A plane's recorder captures every instrument reading so that after a problem, investigators can replay the whole flight second by second instead of guessing. Phoenix captures every prompt, retrieval, tool call, token count, and latency for each request your app handles. When something looks off, you scrub back through the recording and see the exact step that went wrong — and unlike a cloud dashboard, the recorder sits on your desk, not in someone else's data center.

Why it matters

LLM apps fail in ways normal software does not. A web server either returns the right row from a database or it doesn't — the bug is reproducible. An LLM can return a fluent, confident, wrong answer, and run the same input twice with two different results. You cannot fix what you cannot see, and most of the interesting behavior is buried inside prompts and model calls your logs never captured.

Phoenix exists to make that behavior visible and measurable. It solves a few specific pains builders hit:

  • Debugging multi-step pipelines. In a RAG or agent app, a wrong answer can come from bad retrieval, a bad prompt, a failed tool call, or the model itself. A trace shows each step so you can pinpoint the real culprit instead of guessing.
  • Catching bad retrieval. The most common RAG failure is that the right document never made it into the prompt. Phoenix can score whether retrieved chunks were actually relevant to the question, turning a vague 'answers feel off' into a concrete number.
  • Measuring quality, not vibes. Eyeballing a few outputs is not testing. Phoenix runs evaluators across many traces so you can say how often the app hallucinates or retrieves irrelevant context.
  • Watching cost and latency. Each span records token counts and timing, so you can see which step is slow or expensive before it surprises you in a bill.

The reason Phoenix specifically gets reached for is its zero-setup local model. Many observability tools are cloud-first: you sign up, get an API key, and your traces — which often contain real user prompts and private documents — leave your machine. Phoenix can run as a single local process with no account at all, so an engineer can drop it into a notebook during development and inspect traces in seconds, and a privacy-sensitive team can keep every trace inside its own infrastructure. It is a different default from hosted-first platforms like the ones compared in Langfuse vs LangSmith vs Helicone.

How it works

Phoenix is built on OpenTelemetry, the open industry standard for tracing. This matters more than it sounds: instead of inventing its own logging format, Phoenix speaks a protocol that dozens of libraries already emit. You add a small instrumentation package, your LLM framework starts sending traces, and Phoenix collects them. There is no vendor SDK woven through your business logic.

Spans and traces: the unit of recording

A span is a record of one operation — one model call, one retrieval, one tool invocation — with its inputs, outputs, timing, token counts, and any error. A trace is the full tree of spans for handling one request, from the top-level call down through every nested step. Because the spans nest, you see not just what happened but the order and hierarchy: which retrieval fed which prompt, which tool call the agent chose, and where time was spent.

Instrumentation: where the spans come from

You rarely write spans by hand. Phoenix ships auto-instrumentation for common frameworks — LangChain, LlamaIndex, and direct provider SDKs among them. You register the instrumentation once at startup, and from then on every model call and retrieval in those libraries automatically produces a span. Because it is OpenTelemetry under the hood, anything that can emit OTel traces can feed Phoenix.

wiring Phoenix into an apppython
import phoenix as px
from phoenix.otel import register

# 1) Start the local Phoenix app — opens a UI on your own machine,
#    no account, no API key, nothing leaves localhost.
px.launch_app()

# 2) Register OpenTelemetry tracing and auto-instrument your stack.
#    From here on, model + retrieval calls emit spans automatically.
tracer_provider = register(auto_instrument=True)

# 3) Just run your app as usual. Each request becomes a trace you
#    can open, expand span by span, and evaluate in the Phoenix UI.
run_my_rag_pipeline("What's our refund window for physical goods?")

Evaluation: scoring the traces

Recording is half the job; the other half is judging quality. Phoenix includes evaluators that read your captured spans and score them. Many are LLM-as-a-judge evaluators — a model grades each trace against a rubric. Typical built-in checks ask whether the retrieved context was relevant to the question, whether the answer was grounded in that context (a hallucination check), and whether the response was correct given a reference. You can run these over a whole batch of traces or a curated dataset of examples, and the scores show up next to each trace in the UI so weak spots stand out.

What a Phoenix trace actually captures

It helps to see concretely what lands in a single trace for a RAG request. Each row below is one span in the tree; together they let you replay the whole request and spot the broken step.

SpanWhat it recordsWhat you catch with it
RetrieverThe query, the chunks returned, similarity scoresRight document missing from context
LLM callFull prompt, response, token counts, latencyBad prompt, slow or expensive calls
Tool callTool name, arguments, return value, errorsAgent picking the wrong tool or bad args
EmbeddingInput text, vector, model usedWrong or stale embedding model
Root spanEnd-to-end input, final output, total timeOverall pass/fail and total cost

Phoenix vs hosted-first observability tools

Phoenix sits in the same family as other LLM observability tools, but its default posture is different. The contrast is less about features — most cover tracing and evals — and more about where it runs and how you start.

Neither posture is strictly better — they suit different moments. Local-first shines during development and for privacy-sensitive teams who cannot send prompts off-box. Hosted-first shines when you want a managed, always-on dashboard the whole team shares without running infrastructure. And the line is not absolute: Phoenix can also be self-hosted as a persistent server for a team, and it shares an evaluation lineage with the broader Arize platform when you outgrow a single laptop. The honest framing: choose by your constraints — privacy, who needs access, and how much ops you want to own — not by a feature checkbox.

Common pitfalls

Phoenix is easy to start and easy to misuse. Most disappointments trace back to expectations rather than the tool.

  • Treating the local app as permanent storage. A quick launch_app() session is great for development, but an ephemeral local instance is not a durable system of record. For a team that needs traces to persist, run Phoenix as a self-hosted server with a backing database rather than relying on a notebook process.
  • Forgetting evals are themselves models. LLM-as-a-judge evaluators are probabilistic. A 'hallucination' score is a model's opinion, not ground truth, and it carries the usual judge biases. Spot-check the judge against human labels before you trust its numbers.
  • Tracing everything at full volume. In production, recording every span of every request gets expensive and noisy. Decide what to keep — see trace sampling — rather than capturing 100% by reflex.
  • Logging sensitive data without thinking. Spans capture full prompts and outputs, which may contain user PII. Running locally helps, but if you self-host for a team, treat the trace store like any other store of sensitive data.
  • Confusing Phoenix with the Arize cloud product. They are related but not the same. Phoenix is the open-source, developer tool; the commercial Arize platform is the large-scale production offering. Pick docs and features for the one you are actually using.

Going deeper

Once tracing and basic evals are working, a few directions are worth knowing.

Datasets and experiments. Beyond watching live traces, you can collect a fixed set of inputs (a dataset) and run your pipeline over it repeatedly, scoring each run with the same evaluators. That turns Phoenix into a regression-testing harness: change a prompt or swap a model, re-run the dataset, and compare scores side by side instead of trusting a hunch.

Online vs offline use. The same trace data supports two modes. Offline, you analyze a captured batch during development. Online, you keep Phoenix running against a live app and watch quality and latency over time, optionally folding in real user feedback as an extra signal. The evaluators and spans are the same; only the cadence differs.

The OpenTelemetry payoff. Because Phoenix is OTel-native, your instrumentation is not locked to it. The same traces your app emits can flow to other OTel-compatible backends, and you can correlate LLM spans with the rest of your system's tracing. This is the practical reason to prefer a standards-based tool: the recording outlives any single dashboard. As your needs grow toward cost attribution and per-service error budgets, that portability is what keeps you from re-instrumenting everything later.

The durable lesson is the same one that holds across observability: you improve what you can measure. Phoenix's contribution is to make LLM behavior measurable with the lowest possible setup cost — open it locally, see every step, score it, and let the data, not vibes, drive your next change.

FAQ

What is Arize Phoenix used for?

Phoenix is an open-source tool for LLM observability and evaluation. It traces every step your AI app takes — retrievals, prompts, model calls, tool calls — so you can replay and debug them, and it runs evaluators that score those traces for quality, like whether retrieval was relevant or the answer was grounded. It is especially popular for debugging RAG and agent pipelines.

Is Arize Phoenix free and open source?

Yes. Phoenix is open source and free, and it can run entirely on your own machine with no account or signup. It is built and maintained by Arize AI, which also sells a separate commercial observability platform aimed at large-scale production — but Phoenix itself is the free, self-hostable, developer-facing tool.

What is the difference between Phoenix and Langfuse?

Both are open-source LLM observability tools with tracing and evals, and both are OpenTelemetry-friendly. The main difference is default posture: Phoenix is local-first and famously runs with no account, which suits development and privacy-sensitive work, while Langfuse is typically run as a hosted or self-hosted server centered on a shared web app. Choose by your privacy needs, who needs dashboard access, and how much infrastructure you want to run.

Does Arize Phoenix require sending my data to the cloud?

No. Phoenix can run as a local process so that your traces — including the prompts and documents inside them — never leave your machine. That offline-capable default is its signature trait. You can also self-host it as a persistent server for a team if you want durable, shared storage.

How does Phoenix evaluate LLM responses?

Phoenix reads your captured traces and scores them with evaluators. Many are LLM-as-a-judge evaluators, where a model grades each trace against a rubric — for example, checking whether retrieved context was relevant or whether the answer was supported by that context (a hallucination check). You can run these across a batch of traces or a fixed dataset and see the scores next to each trace in the UI.

Does Phoenix work with LangChain and LlamaIndex?

Yes. Phoenix ships auto-instrumentation for common frameworks including LangChain and LlamaIndex, plus direct provider SDKs. You register the instrumentation once at startup and their model and retrieval calls automatically emit spans. Because Phoenix is OpenTelemetry-native, anything that can emit OTel traces can feed it.

Further reading