Langfuse

Open-source observability, prompt management, and evals for LLM apps

github.com/langfuse/langfuse★ 29.4k langfuse.com

Overview

Langfuse is an open-source LLM engineering platform that helps teams develop, monitor, evaluate, and debug AI applications. You instrument your app to send traces to Langfuse, then inspect each LLM call along with related steps like retrieval, embedding, and agent actions.

It is built for engineering and data teams running LLM-based features in production who need to see what their models actually do. Beyond tracing, it bundles prompt management with version control, evaluations (including LLM-as-a-judge, code evaluators, and human feedback), datasets for benchmarking, and a playground for iterating on prompts.

In the LLM observability space, Langfuse covers the full development loop rather than just logging. You can run it as a managed cloud service or self-host it on your own infrastructure in a few minutes with Docker Compose, and reach every feature through typed SDKs for Python and JS/TS or the HTTP API.

What it does

Tracing for LLM and agent calls, including retrieval, embedding, and other app logic, with session and user inspection
Prompt management with central versioning, collaboration, and client- and server-side caching so iteration does not add latency
Evaluations via LLM-as-a-judge, code evaluators, user feedback, manual labeling, and custom pipelines through the API/SDK
Datasets for test sets and benchmarks, with integrations for frameworks like LangChain and LlamaIndex
LLM Playground to test and iterate on prompts and model configurations, reachable directly from a trace
Self-hostable via Docker Compose, Kubernetes (Helm), or Terraform templates for AWS, Azure, and GCP

Getting started

You can self-host Langfuse with Docker Compose, then instrument a Python app with the SDK. (A managed cloud option with a free tier is also available.)

Run Langfuse locally with Docker Compose

Clone the repository and start the stack. This brings up the full Langfuse server on your own machine.

bashbash

# Get a copy of the latest Langfuse repository
git clone --depth=1 https://github.com/langfuse/langfuse.git
cd langfuse

# Run the langfuse docker compose
docker compose up

Install the Python SDK

Add the langfuse package to your application. A JS/TS package is also published on npm as langfuse.

bashbash

pip install langfuse

Set your API keys

Create a project in the Langfuse UI to get your keys, then set them as environment variables. Point the base URL at your self-hosted instance or at the cloud host.

bashbash

LANGFUSE_PUBLIC_KEY="pk-lf-..."
LANGFUSE_SECRET_KEY="sk-lf-..."
LANGFUSE_HOST="http://localhost:3000"

Trace your first LLM call

Swap your OpenAI import for the Langfuse wrapper to capture calls automatically, with no other changes to how you write code.

pythonpython

from langfuse.openai import openai

completion = openai.chat.completions.create(
    name="test-chat",
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a calculator."},
        {"role": "user", "content": "1 + 1 = "},
    ],
)

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Debug a multi-step agent or RAG pipeline by tracing each LLM call, retrieval, and embedding to find where a bad answer came from
Manage and version prompts centrally so a team can iterate on them without redeploying the application
Run evaluations on production traffic using LLM-as-a-judge, code checks, or human feedback to track output quality over time
Build test sets and benchmarks with datasets to compare prompt and model changes before shipping them

How Langfuse compares

Langfuse alongside other open-source observability & llmops tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Langfuse	★ 29.4k	Open-source observability, prompt management, and evals for LLM apps
Opik	★ 19.7k	An open-source platform from Comet for tracing, evaluating, and monitoring LLM applications, RAG systems, and agent workflows with dashboards and LLM-as-judge metrics.
TensorZero	★ 11.7k	An open-source LLMOps platform that puts a single gateway in front of every major LLM provider and adds observability, evaluation, optimization, and A/B testing.
Evidently	★ 7.6k	A monitoring and evaluation framework for ML and LLM systems that tracks output quality, drift, and test results over time with reports and dashboards.
OpenLLMetry	★ 7.2k	An OpenTelemetry-based SDK that auto-instruments LLM providers, vector databases, and frameworks so traces flow into any existing observability backend.
Helicone	★ 5.8k	A proxy-based observability platform that logs, monitors, and evaluates LLM API calls by routing requests through its endpoint with one line of code.
AgentOps	★ 5.6k	An SDK for monitoring AI agents that tracks LLM cost, session replays, and performance across frameworks like CrewAI, LangChain, and the OpenAI Agents SDK.
Pydantic Logfire	★ 4.3k	An observability platform from the Pydantic team that records LLM calls, agent runs, and tool invocations with tokens, cost, and latency attached.

// Overview

// What it does

// Getting started

Run Langfuse locally with Docker Compose

Install the Python SDK

Set your API keys

Trace your first LLM call

// When to use it

// How Langfuse compares

Overview

What it does

Getting started

When to use it

How Langfuse compares