AI/TLDR

What Is LLM Observability? Logs, Traces, and Tokens

Learn what to record on every LLM call and why ordinary application monitoring misses what matters in AI apps.

BEGINNER11 MIN READUPDATED 2026-06-11

In plain English

LLM observability is the practice of recording enough detail about every call your app makes to a language model that you can later reconstruct exactly what happened — the prompt that went in, the answer that came back, how many tokens it cost, how long it took, and what the rest of your code did with the result.

Think of a flight recorder — the black box on a plane. Nobody watches it during a smooth flight. But the moment something goes wrong, investigators replay every reading second by second to understand the failure. LLM observability is your app's flight recorder: while everything is fine you ignore it, and when a user says "the chatbot gave me garbage," you open the trace for that exact request and see the whole story.

The word observability comes from control theory — a system is observable if you can figure out its internal state purely from its outputs. For software it has a more practical meaning: can you answer a brand-new question about your running system without shipping new code? With good observability, when someone asks "why did this answer cost 40 cents?" you already have the data on hand. Without it, you're guessing.

Why it matters

Traditional monitoring tools — the APM (application performance monitoring) world of Datadog, New Relic, and Sentry — were built for deterministic software. The same input gives the same output, errors throw exceptions, and a 200 response code means success. LLM apps break all three of those assumptions, which is why ordinary monitoring quietly misses the failures that actually hurt you.

  • Same input, different output. Language models sample their answers, so the same prompt can produce a great reply on Monday and a hallucinated one on Tuesday. A bug you can't reproduce on demand is invisible to test-once monitoring. See LLM temperature explained for why this happens.
  • A 200 OK can still be wrong. The model returns HTTP 200 and a perfectly-formatted paragraph that is completely false. Your status-code dashboard stays green while users get bad answers. The failure is in the content, not the transport.
  • Cost is invisible by default. Every call burns tokens, and tokens are money. A single runaway agent loop or a bloated context window can quietly 10x your bill. APM tools track CPU and memory, not tokens.
  • Failures hide deep in a chain. A RAG pipeline or agent makes many model and tool calls per user request. The final answer looks fine but the retrieval step pulled the wrong document three steps back. You need to see the whole chain, not just the last call.

So who should care? Anyone running an LLM feature for real users. Observability is a core pillar of LLMOps — the discipline of operating model-powered apps. It's what turns "a user complained" into "here is the exact trace, the prompt was missing the system instructions, fixed." Without it you debug by re-running prompts in a playground and hoping the bug reappears.

How it works

Mechanically, observability means wrapping every model call so that a record is captured around it — before the request, after the response, and around any error. That record is sent to a backend (a database or a hosted platform) where you can search, chart, and replay it. The unit of capture is a span: a single timed operation with attributes attached. One LLM call is one span; a whole user request is a trace made of many nested spans.

For a multi-step app, those spans nest into a tree. The outer span is the user's request; inside it sit a retrieval span, one or more model spans, and tool-call spans. This tree is the trace, and being able to expand it is the whole point — you see exactly which step was slow, which one was expensive, and where a wrong answer was born.

What to record on every call

The fields below are the bread and butter of LLM observability. Capture them on every model call and you can answer most production questions without writing new code.

FieldWhat it isWhy you need it
modelExact model id calledTrack behavior and cost per model; spot accidental upgrades
prompt / messagesFull input sentReplay and debug the actual prompt, not what you think you sent
completionModel's raw answerInspect bad outputs; feed quality scoring
input_tokens / output_tokensToken countsCost attribution and context-bloat detection
latency_msWall-clock timeFind slow steps; set alerts on p95
temperature / paramsSampling settingsReproduce nondeterministic answers
trace_id / parent_idPosition in the treeStitch spans into one request
user_id / sessionWho and which conversationPer-user debugging and abuse detection
error / statusFailure detailDistinguish refusals, timeouts, rate limits

Instrumenting a call yourself

You don't need a platform to start. Observability is fundamentally just capturing a record around each call. Here's the smallest useful version in Python — a wrapper that logs prompt, answer, tokens, and latency as structured JSON. Once it's JSON, any log search tool can chart it.

observe.pypython
import json, time, uuid
from anthropic import Anthropic

client = Anthropic(api_key="sk-...")  # placeholder

def observed_call(prompt: str, user_id: str) -> str:
    trace_id = str(uuid.uuid4())
    started = time.perf_counter()
    record = {"trace_id": trace_id, "user_id": user_id,
              "model": "claude-sonnet-4-5", "prompt": prompt}
    try:
        resp = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt}],
        )
        answer = resp.content[0].text
        record.update(
            completion=answer,
            input_tokens=resp.usage.input_tokens,
            output_tokens=resp.usage.output_tokens,
            status="ok",
        )
        return answer
    except Exception as e:        # capture failures too
        record.update(status="error", error=str(e))
        raise
    finally:
        record["latency_ms"] = round((time.perf_counter() - started) * 1000)
        # one structured line per call — pipe this to your log backend
        print(json.dumps(record))

That finally block is the whole trick: it runs whether the call succeeds, fails, or times out, so you never lose the record of an error — which is exactly the case you most want to see. The usage object on the response gives you real token counts straight from the provider, so you aren't estimating cost. (Most provider SDKs, including the Claude API, return a usage block like this.)

Roll-your-own logging is the right way to start, but it stops scaling once you have nested chains, want side-by-side trace views, or need to score outputs. That's when you reach for a real platform — or, better, for the open standard underneath them.

The tool landscape

There are two layers here, and people constantly conflate them. The bottom layer is the standard for how a trace is shaped. The top layer is the platform that stores and visualizes traces. Pick the standard first; it keeps you portable across platforms.

OpenTelemetry (OTel) is the vendor-neutral standard for traces, metrics, and logs across all of software. It now ships GenAI semantic conventions — agreed names for LLM span attributes like the model, token counts, and operation type. Emit OTel spans and you can switch backends without rewriting instrumentation. Libraries like OpenLLMetry auto-instrument popular SDKs to emit exactly these spans.

ToolWhat it isNotable for
OpenTelemetryOpen standard, not a productPortability; the GenAI semantic conventions
LangfuseOpen-source LLM platformSelf-hostable tracing, evals, prompt management
LangSmithHosted platform (LangChain)Deep tracing, dataset + eval tooling
Arize PhoenixOpen-source, OTel-nativeLocal-first tracing and evaluation
Datadog / SentryGeneral APM, now with LLM viewsIf you already run them for the rest of your stack

Most of these platforms also bolt on evaluation — scoring live traffic for quality, not just recording it. That overlaps with LLM evals and pairs naturally with guardrails on the input/output. Observability tells you what happened; evals and guardrails tell you whether it was good and safe.

Common pitfalls

  • Logging only on success. The failed and timed-out calls are the ones you most need. Always capture in a finally block so errors get recorded too.
  • Storing prompts in plain text with PII. A debugging goldmine and a compliance time bomb. Redact or hash user content before it leaves your process.
  • Watching status codes instead of content. HTTP 200 with a hallucinated answer is your most common real failure. Track quality, not just transport.
  • No trace id. Without a shared id stitching spans together, a multi-step request becomes a pile of disconnected log lines you can't reassemble.
  • Sampling away the interesting traces. High-traffic apps sample to control cost — but if you sample randomly you drop the rare expensive and failing calls. Keep all errors and a sample of the rest.
  • Treating tokens as free. Not charting input_tokens per feature is how surprise bills happen. It's the single highest-value number to put on a dashboard.

Going deeper

Once basic tracing is in place, the frontier of LLM observability is turning raw traces into judgments about quality and steering production based on them. A few of the harder problems teams hit at scale:

Online evaluation and scoring

Recording a trace tells you what happened; it doesn't tell you if the answer was good. Online evaluation attaches a quality score to live traffic — sometimes a cheap heuristic (did the output parse as valid JSON? did it contain a banned phrase?), sometimes a second model acting as an LLM-as-a-judge. Running a judge on 100% of traffic is expensive, so teams score a sample, weight toward low-confidence or flagged traces, and surface the worst ones for human review. This is the bridge from passive observability to active quality control.

Drift detection

Your model behavior can change even when your code doesn't — a provider updates the model behind a version alias, your users start asking different questions, or your retrieved documents go stale. Drift detection watches the distribution of inputs and outputs over time (answer length, refusal rate, embedding distribution of queries) and alerts when today looks different from last week. It's the difference between finding a regression in an hour versus after a month of quiet degradation. This is why teams version their prompts; see prompt management.

Cost attribution and cardinality

Attributing token spend to a user, feature, or customer means tagging every span with those dimensions — but high-cardinality tags (millions of distinct user ids) are exactly what blows up the storage cost of a metrics system. The standard pattern is to keep high-cardinality data on traces (sampled, searchable) and only roll up low-cardinality aggregates into metrics (cheap, always-on). Getting this split right is most of what separates a cheap observability bill from an expensive one.

Streaming and the latency that matters

When you stream tokens to the user, total latency is the wrong number to watch. What users feel is time to first token (TTFT) — how long before text starts appearing. Good observability records TTFT separately from total duration, because a response that takes eight seconds total but starts streaming in 300ms feels instant, while one that blocks for two seconds before anything appears feels broken — even though it's faster overall.

FAQ

What is the difference between LLM observability and monitoring?

Monitoring watches a fixed set of known metrics (error rate, latency) and alerts when they cross a threshold. Observability is the richer data — full traces, prompts, completions, tokens — that lets you answer new questions you didn't predict, like "why did this specific answer cost so much?", without shipping new code. Monitoring tells you something is wrong; observability tells you why.

What should I log on every LLM call?

At minimum: the model id, the full prompt and completion, input and output token counts, latency, the sampling parameters, a trace id linking spans in one request, a user or session id, and any error or status. Tokens and the trace id are the two people most often forget — and the two that matter most for cost and for debugging multi-step chains.

Why isn't Datadog or Sentry enough for LLM apps?

Classic APM tools assume deterministic code where a 200 response means success and errors throw exceptions. LLM apps return well-formatted but wrong answers with a 200, produce different output for the same input, and burn tokens that those tools don't track. They've added LLM views, but you still need AI-specific signals — token counts, prompts, completions, and quality scores — on top.

Do I need a paid platform or can I roll my own?

You can start with a simple wrapper that logs structured JSON around each call, and you should — it teaches you what matters. You outgrow it once you have nested chains, want side-by-side trace replay, or need to score outputs. At that point adopt OpenTelemetry's GenAI conventions and point them at an open-source backend like Langfuse or Arize Phoenix, or a hosted one like LangSmith.

What is a trace in LLM observability?

A trace is the full record of one user request, made of nested spans. Each span is a single timed operation — a retrieval, a model call, a tool call — with attributes like tokens and latency attached. A trace id stitches them into a tree so you can expand the whole request and see exactly which step was slow, expensive, or wrong.

How do I track LLM costs with observability?

Record the input and output token counts that the provider returns in its usage block on every call, tag each span with the feature or user it belongs to, then chart tokens per feature over time. Don't estimate from text length — use the real counts. The first chart almost always reveals one prompt sending far more context than the task needs.

Further reading