AI/TLDR

What Is Langfuse? Open-Source LLM Observability

After reading, you'll understand what Langfuse is, how it traces and evaluates LLM apps, and why teams pick an open-source, self-hostable observability stack.

INTERMEDIATE10 MIN READUPDATED 2026-06-14

In plain English

When a normal web request fails, you check the logs. But an LLM app is not one request — it is a chain. A user asks a question, your code rewrites it, retrieves some documents, calls a model, the model asks to run a tool, you run it, you call the model again, and only then does an answer come out. When the final answer is wrong, which of those steps went wrong? Plain logs won't tell you.

Langfuse — illustration
Langfuse — langfuse.com

Langfuse is an open-source observability platform built for exactly this. It records every step of every LLM interaction as a connected trace — the prompt, the retrieved context, the model's reply, the tool calls, how many tokens each step used, how long it took, and what it cost. Then it gives you a dashboard to search those traces, score their quality, and manage the prompts that produced them.

Think of a flight recorder — the "black box" on an aircraft. While the plane flies, it quietly captures every instrument reading. When something goes wrong, investigators replay the recording to see the exact sequence that led to the failure. Langfuse is the black box for your AI app: it captures the full flight of every request so that when an answer is bad, you can replay it step by step instead of guessing.

Why it matters

Once an LLM feature reaches real users, three questions arrive almost immediately, and none of them can be answered by reading code: Why did this specific answer go wrong? Is quality getting better or worse over time? Where is all the money going? Langfuse exists to answer those questions with data instead of opinion.

  • Debugging non-determinism. The same prompt can give a great answer one minute and a broken one the next. Without a recorded trace of the exact inputs — the retrieved chunks, the system prompt, the tool outputs — you cannot reproduce the failure, so you cannot fix it.
  • Cost and latency visibility. Token usage is invisible until you measure it. A trace shows the token count and cost of every single call, so you can find the one retrieval step that quietly stuffs 40,000 tokens into the context on every request — see LLM cost attribution.
  • Quality measurement. "It feels better" is not a metric. Langfuse attaches scores to traces — from automated evals, from an LLM-as-a-judge, or from real user thumbs-up/down — so you can track whether a prompt change actually helped or just looked good in a demo.
  • No vendor lock-in. Because Langfuse is open-source and self-hostable, your trace data — which often contains sensitive prompts and user content — can stay on your own infrastructure, inside your own compliance boundary.

Who cares? Any team running LLM features past the prototype stage. The moment you have paying users, a support ticket that says "the bot gave me a wrong refund amount" needs an answer, and that answer lives in a trace. The alternative — sprinkling print statements through an agent loop and hoping — does not scale past a handful of requests.

How it works

Langfuse has two halves. Your app sends telemetry (traces) to a Langfuse backend; the Langfuse UI lets you read, search, score, and analyze that telemetry. You add a small SDK or wrap your existing model client, and from then on every call is captured automatically.

Traces and observations: the core data model

The whole system is built on one nesting idea. A trace is one complete request — the user's question all the way to the final answer. Inside a trace are observations, the individual steps. Langfuse names three kinds of observation: a span (any unit of work, like a retrieval step), a generation (specifically a model call, which also records the prompt, completion, token counts, and cost), and an event (a single point-in-time marker). Observations nest inside each other, so a span can contain a generation that contains another span — exactly mirroring how your code is structured.

Two more groupings sit on top. A session ties many traces together into one conversation, so you can replay a whole multi-turn chat. Scores attach a quality number or label to a trace or observation — that is how evals, human feedback, and LLM-judge results get recorded against the run they belong to.

OpenTelemetry-native: why that matters

Langfuse speaks OpenTelemetry (OTel), the open industry standard for traces, metrics, and logs. In practice this means you are not locked into one proprietary SDK: any tool or framework that can emit OTel-style spans can send data to Langfuse, and the traces it stores follow an open format rather than a closed one. It also means LLM traces can sit alongside the rest of your application traces in a familiar shape, instead of in a separate silo.

Instrumenting your code

You rarely write trace plumbing by hand. The common pattern is a decorator that wraps a function, plus wrapped model clients that auto-capture each call. The example below sketches the idea — a top-level traced function with a nested model call inside it.

instrumenting a request with Langfusepython
from langfuse import observe, get_client

langfuse = get_client()

@observe()  # this whole function becomes one trace
def answer_question(question: str) -> str:
    # A nested span: retrieval is recorded as its own step.
    with langfuse.start_as_current_span(name="retrieve") as span:
        chunks = vector_search(question, k=3)
        span.update(metadata={"k": 3, "hits": len(chunks)})

    # A generation: the model call, with prompt + usage captured.
    context = "\n".join(chunks)
    reply = call_model(question, context)  # wrapped client auto-logs tokens/cost

    # Attach a quality score to this trace (e.g. from an eval or user vote).
    langfuse.score_current_trace(name="helpful", value=1)
    return reply

Beyond tracing, Langfuse adds two production features that share the same data. Prompt management stores your prompts as versioned, named objects on the server, so you can edit a prompt and roll it out without a code deploy — and each trace records which prompt version produced it. Evaluations run scorers (code-based or LLM-as-a-judge) over your traces or over a saved golden dataset, writing the results back as scores you can chart over time.

A worked debugging session

Concretely, here is how a trace turns a vague complaint into a fixed bug. A user reports: "I asked for the return window on a laptop and the bot said 14 days, but our policy is 30." You search Langfuse for that user's session and open the trace.

  1. Open the trace and see the five nested observations: query rewrite, retrieve, tool call, and two generations.
  2. Inspect the retrieve span. It returned three chunks — but all three are about digital goods ("non-refundable, 14-day grace"), none about physical items. The retrieval is wrong, not the model.
  3. Confirm at the generation step. The model's input shows exactly those three off-topic chunks pasted into the context. Given that context, "14 days" was a faithful answer to the wrong documents.
  4. Check the prompt version. The trace records which prompt version produced the call, so you know the change you are about to make is the one being tested.
  5. Fix the real cause — the chunking or the retrieval query — re-run, and watch the new trace return the correct policy chunk.

Where Langfuse fits among the tools

Langfuse is one of several LLM observability tools, and they split mainly along who hosts your data and how broad the feature set is. The table below is a conceptual map, not a scoreboard — pick based on your constraints, not a ranking.

TraitLangfuseHosted-only platformsLocal-first tools
Source modelOpen-core (OSS + paid tiers)Closed / proprietaryOpen-source
Where data livesSelf-host or their cloudTheir cloud onlyYour machine / your infra
StandardOpenTelemetry-nativeOften proprietary SDKOften OTel-native
ScopeTracing + evals + promptsTracing + evalsTracing + evals
Best whenYou want OSS + a hosted optionYou want zero-ops managedYou want no-account local runs

The decisive feature is the open-core, self-hostable model. Many teams cannot send raw prompts and user content to a third-party SaaS for privacy or compliance reasons; with Langfuse they run the same platform inside their own boundary. Teams without that constraint can use the managed cloud and skip running it themselves. For a head-to-head with the main alternatives, see Langfuse vs LangSmith vs Helicone.

Common pitfalls

  • Tracing nothing useful. A trace with just the final input and output is barely better than a log line. The value is in the intermediate steps — the retrieved context and tool results — so instrument those, not only the top-level call.
  • Logging secrets and PII into traces. Whatever you capture is stored. Prompts and tool outputs often contain personal data or API keys; mask or redact sensitive fields before they reach the backend, and lean on self-hosting when data residency is the concern.
  • Tracing 100% at high volume. Capturing every request gets expensive and noisy at scale. Most teams sample — keep all errors and a fraction of the rest — which is the topic of trace sampling for LLMs.
  • Treating it as logs-only. If you never add scores, you get a searchable archive but no quality signal. Wire in user feedback and at least one automated eval so trends, not anecdotes, drive your decisions.
  • Blocking on the network. Sending telemetry synchronously can add latency to user requests. The SDKs batch and send in the background by default — don't undo that by flushing on every call in a hot path.

Going deeper

Once basic tracing is in place, Langfuse becomes the hub that connects debugging, evaluation, and iteration. A few directions worth knowing as you grow.

Datasets and offline evaluation. You can curate a set of real traces (especially the failures) into a saved dataset, then re-run new prompt or model versions against it and compare scores. This turns ad-hoc "looks better" testing into regression testing for prompts — the difference between offline and online evaluation matters here, since dataset runs test a fixed set while live scores watch real traffic.

Closing the feedback loop. Scores can come from three sources at once: automated code checks, an LLM-as-a-judge, and real user feedback like thumbs-up/down. Funnelling all three onto the same traces lets you ask precise questions — "do the prompts users disliked share a retrieval pattern?" — and answer them by filtering.

Agents and deep nesting. AI agents produce deeply nested traces — loops of think, call tool, observe, repeat. The nested observation tree is what makes an otherwise opaque agent run inspectable: you can see each decision, each tool result, and where a loop went off the rails. This is the point where observability stops being a nicety and becomes the only practical way to debug.

The honest limits. Observability tells you what happened, not why the model chose it — it cannot open the model's weights. It also adds operational surface: self-hosting means running a backend and its datastore, and the data you collect is itself sensitive and must be secured. And like any measurement system, it only answers questions you instrument for — an untraced step is an invisible one. The durable takeaway: in production, an LLM call you cannot replay is a bug you cannot fix, so capture the full chain first and optimize from there.

FAQ

What is Langfuse used for?

Langfuse is an open-source observability platform for LLM and agent apps. Teams use it to trace every step of a request (prompt, retrieved context, tool calls, model output, tokens, latency, and cost), to score answer quality with evals and user feedback, and to version and manage prompts — all so they can debug bad answers and track quality over time.

Is Langfuse free and open-source?

Yes, Langfuse follows an open-core model. The core platform is open-source and you can self-host it for free; some advanced features and a managed cloud are offered as paid tiers. Self-hosting is popular because it keeps sensitive prompts and user data inside your own infrastructure.

Do I need LangChain to use Langfuse?

No. Despite the shared "Lang" prefix, Langfuse is framework-agnostic. It works with LangChain, LlamaIndex, plain SDK calls, or any setup, because it is OpenTelemetry-native and can ingest traces from many sources rather than being tied to one library.

What is the difference between a trace and an observation in Langfuse?

A trace is one complete request from start to finish — the whole user interaction. Observations are the individual steps inside it: spans (units of work like a retrieval), generations (model calls, which also capture tokens and cost), and events (point-in-time markers). Observations nest inside the trace to mirror your code's structure.

Langfuse vs LangSmith — what's the difference?

Both trace and evaluate LLM apps, but Langfuse is open-source and self-hostable (open-core), while LangSmith is a proprietary, hosted-only platform from the LangChain team. The main trade-off is data control and lock-in versus zero-ops convenience. See the dedicated comparison article for a fuller breakdown.

Does Langfuse add latency to my app?

Very little in practice. The SDKs batch telemetry and send it to the backend asynchronously in the background, so capturing traces does not block the user-facing request. You only risk added latency if you force synchronous flushing on every call in a hot path.

Further reading