AI/TLDR

Langfuse

Open-source observability, prompt management, and evals for LLM apps

Overview

Langfuse is an open-source LLM engineering platform that helps teams develop, monitor, evaluate, and debug AI applications. You instrument your app to send traces to Langfuse, then inspect each LLM call along with related steps like retrieval, embedding, and agent actions.

It is built for engineering and data teams running LLM-based features in production who need to see what their models actually do. Beyond tracing, it bundles prompt management with version control, evaluations (including LLM-as-a-judge, code evaluators, and human feedback), datasets for benchmarking, and a playground for iterating on prompts.

In the LLM observability space, Langfuse covers the full development loop rather than just logging. You can run it as a managed cloud service or self-host it on your own infrastructure in a few minutes with Docker Compose, and reach every feature through typed SDKs for Python and JS/TS or the HTTP API.

What it does

  • Tracing for LLM and agent calls, including retrieval, embedding, and other app logic, with session and user inspection
  • Prompt management with central versioning, collaboration, and client- and server-side caching so iteration does not add latency
  • Evaluations via LLM-as-a-judge, code evaluators, user feedback, manual labeling, and custom pipelines through the API/SDK
  • Datasets for test sets and benchmarks, with integrations for frameworks like LangChain and LlamaIndex
  • LLM Playground to test and iterate on prompts and model configurations, reachable directly from a trace
  • Self-hostable via Docker Compose, Kubernetes (Helm), or Terraform templates for AWS, Azure, and GCP

Getting started

You can self-host Langfuse with Docker Compose, then instrument a Python app with the SDK. (A managed cloud option with a free tier is also available.)

Run Langfuse locally with Docker Compose

Clone the repository and start the stack. This brings up the full Langfuse server on your own machine.

bashbash
# Get a copy of the latest Langfuse repository
git clone --depth=1 https://github.com/langfuse/langfuse.git
cd langfuse

# Run the langfuse docker compose
docker compose up

Install the Python SDK

Add the langfuse package to your application. A JS/TS package is also published on npm as langfuse.

bashbash
pip install langfuse

Set your API keys

Create a project in the Langfuse UI to get your keys, then set them as environment variables. Point the base URL at your self-hosted instance or at the cloud host.

bashbash
LANGFUSE_PUBLIC_KEY="pk-lf-..."
LANGFUSE_SECRET_KEY="sk-lf-..."
LANGFUSE_HOST="http://localhost:3000"

Trace your first LLM call

Swap your OpenAI import for the Langfuse wrapper to capture calls automatically, with no other changes to how you write code.

pythonpython
from langfuse.openai import openai

completion = openai.chat.completions.create(
    name="test-chat",
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a calculator."},
        {"role": "user", "content": "1 + 1 = "},
    ],
)

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Debug a multi-step agent or RAG pipeline by tracing each LLM call, retrieval, and embedding to find where a bad answer came from
  • Manage and version prompts centrally so a team can iterate on them without redeploying the application
  • Run evaluations on production traffic using LLM-as-a-judge, code checks, or human feedback to track output quality over time
  • Build test sets and benchmarks with datasets to compare prompt and model changes before shipping them

How Langfuse compares

Langfuse alongside other open-source observability & llmops tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Langfuse★ 29.4kOpen-source observability, prompt management, and evals for LLM apps
Opik★ 19.7kAn open-source platform from Comet for tracing, evaluating, and monitoring LLM applications, RAG systems, and agent workflows with dashboards and LLM-as-judge metrics.
TensorZero★ 11.7kAn open-source LLMOps platform that puts a single gateway in front of every major LLM provider and adds observability, evaluation, optimization, and A/B testing.
Evidently★ 7.6kA monitoring and evaluation framework for ML and LLM systems that tracks output quality, drift, and test results over time with reports and dashboards.
OpenLLMetry★ 7.2kAn OpenTelemetry-based SDK that auto-instruments LLM providers, vector databases, and frameworks so traces flow into any existing observability backend.
Helicone★ 5.8kA proxy-based observability platform that logs, monitors, and evaluates LLM API calls by routing requests through its endpoint with one line of code.
AgentOps★ 5.6kAn SDK for monitoring AI agents that tracks LLM cost, session replays, and performance across frameworks like CrewAI, LangChain, and the OpenAI Agents SDK.
Pydantic Logfire★ 4.3kAn observability platform from the Pydantic team that records LLM calls, agent runs, and tool invocations with tokens, cost, and latency attached.