In plain English
Every time your LLM app calls a model, several things can go wrong: the prompt may be malformed, the retrieved context may be irrelevant, the model may hallucinate, the response may be slow, or costs may spiral. LLM observability tools record what happened on every call — inputs, outputs, token counts, latency, cost — so you can debug failures and track quality over time.
Langfuse, LangSmith, and Helicone are the three tools teams reach for first. They all solve the same core problem — visibility into LLM calls — but they approach it differently, target different workflows, and charge differently. Picking the wrong one leads to either a vendor lock-in headache or a half-instrumented app that still can't tell you why a response went bad.
A useful analogy: think of an LLM app as a restaurant kitchen. Helicone is the ticket printer by the pass — it captures every order (request) and every plate (response) as it goes by, with minimal setup. LangSmith is the head chef's clipboard, tightly integrated with the LangChain cooking workflow, tracking each prep step and plating decision. Langfuse is a full kitchen management system — open-source, self-hostable, and covering tracing, prompt management, and graded evaluations across any recipe (framework).
Why it matters
Without an observability tool you are flying blind. A slow response in production could come from a bloated system prompt, a slow vector-DB query, or a model that started generating extra tokens — a flat log line won't tell you which. A quality regression after a prompt change is invisible until users complain, unless you have scores attached to every trace.
- Debugging production failures. Trace trees show exactly which step in a multi-hop pipeline produced the bad output — retrieval, the model call, or post-processing.
- Cost attribution. Token spend per endpoint, per user, or per feature is invisible without per-call instrumentation. Observability tools surface this automatically.
- Quality tracking over time. Automated evaluation scores attached to traces let you detect prompt regressions the moment a new version ships, not a week later.
- Prompt management. All three tools offer some form of prompt versioning so you can A/B test prompt changes without deploying new code.
- Compliance and auditing. Self-hosted deployments keep sensitive prompt data — which often contains PII — off third-party servers.
How each tool works
The three tools use fundamentally different integration architectures, which is the most important thing to understand before choosing one.
- Proxy-based: change one base URL
- No SDK required — any HTTP client
- Logs requests and responses at the edge
- Adds ~50-80 ms latency via proxy
- Multi-provider: 100+ models via one endpoint
- SDK-based: set two environment variables
- Deep auto-instrumentation for LangChain
- Uses internal RunTree data model
- OpenTelemetry support added March 2025
- Closed-source SaaS, enterprise self-host
- SDK-based or OpenTelemetry OTLP
- Framework-agnostic: 80+ integrations
- Span/trace model with @observe() decorator
- Open-source (MIT), self-host free
- Also accepts OTel from any language
Helicone: the proxy model
Helicone sits in front of your LLM provider as an HTTP proxy. You replace https://api.openai.com with https://oai.helicone.ai and add your Helicone API key as a header. Every request and response passes through Helicone's infrastructure (built on Cloudflare Workers, ClickHouse, and Kafka), which logs the pair and returns it to your app unchanged. Because it operates at the HTTP level, Helicone works with any language or framework without installing a special SDK.
# Before: direct OpenAI call
from openai import OpenAI
client = OpenAI(api_key="sk-...")
# After: route through Helicone proxy (one-line change)
client = OpenAI(
api_key="sk-...",
base_url="https://oai.helicone.ai/v1",
default_headers={"Helicone-Auth": "Bearer hc-..."},
)LangSmith: the LangChain-native model
LangSmith integrates by reading two environment variables — LANGCHAIN_API_KEY and LANGCHAIN_TRACING_V2=true. Once set, every LangChain and LangGraph call is traced automatically with no code changes. Each step becomes a run in LangSmith's internal RunTree model, which mirrors LangChain's execution graph exactly. For non-LangChain code you can use the @traceable decorator to create custom runs.
import os
os.environ["LANGCHAIN_API_KEY"] = "ls-..."
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "my-project"
# All LangChain calls below are now traced automatically
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
llm = ChatOpenAI(model="gpt-4o")
response = llm.invoke([HumanMessage(content="Explain attention")])Langfuse: the span-tree model
Langfuse uses a span/trace model aligned with OpenTelemetry. The @observe() decorator wraps any Python function and creates a named span automatically, with parent-child nesting inferred from the call stack. You can also import Langfuse's OpenAI drop-in wrapper for zero-code auto-instrumentation of model calls, or send raw OTel spans via OTLP if you prefer vendor-neutral instrumentation.
import os
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-..."
os.environ["LANGFUSE_SECRET_KEY"] = "sk-..."
from langfuse.decorators import observe
from langfuse.openai import openai # drop-in wrapper
@observe() # creates a trace for the whole pipeline
def answer(query: str) -> str:
docs = retrieve(query) # nested span
return call_model(docs, query) # nested span
@observe(name="retrieve")
def retrieve(query: str) -> list[str]:
# ... vector DB call ...
return []
@observe(name="llm_call")
def call_model(docs, query):
return openai.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": query}]
)Feature-by-feature comparison
The tools converge on basic tracing but diverge sharply on evaluations, prompt management, and data ownership. Here is how they stack up on the features that matter most when building production LLM apps.
| Feature | Langfuse | LangSmith | Helicone |
|---|---|---|---|
| Open-source license | MIT (full core) | Closed-source | Apache 2.0 |
| Self-hosting | Free, first-class | Enterprise license required | Free (Apache 2.0) |
| Integration model | SDK + OTel OTLP | SDK (env vars) | Proxy (one URL change) |
| Framework support | 80+ frameworks, any LLM | LangChain/LangGraph native; others via SDK | Any HTTP client |
| Span-level tracing | Yes — full tree | Yes — RunTree model | Request/response pairs only |
| Multi-step agent tracing | Yes, deep nesting | Yes, deep nesting for LangGraph | Sessions (stitched after the fact) |
| Built-in evaluations | LLM-as-judge, heuristics, human annotation | LLM-as-judge, built-in evaluators, CI/CD | Basic only (no LLM-as-judge) |
| Prompt management | Yes — versioned, A/B, playground | Yes — versioned, playground | Limited |
| Dataset management | Yes | Yes | No |
| Cost tracking | Basic (token counts) | Basic (token counts) | Advanced (multi-provider, auto-calculated) |
| Built-in caching | No | No | Yes — edge caching via Cloudflare |
| Gateway / routing | No | No | Yes — 100+ providers, rate limiting, failover |
| Data retention (free tier) | 30 days | 14 days (base traces) | 7 days |
| Active development | Yes (rapid) | Yes (rapid) | Maintenance mode since March 2026 |
Evaluations: where LangSmith and Langfuse pull ahead
Both Langfuse and LangSmith offer LLM-as-a-judge evaluation pipelines that score live traces on criteria like accuracy, relevance, and faithfulness. LangSmith ships built-in evaluator templates and tight CI/CD integration so you can gate a deployment on eval scores. Langfuse offers human annotation queues alongside automated evals, which is useful when you want domain experts to label ambiguous cases. Helicone offers only basic manual scoring with no LLM-as-judge capability.
Helicone's unique strengths (while active)
Before its maintenance-mode shift, Helicone's gateway layer provided genuine value beyond logging: edge caching (serving repeated identical requests without hitting the model), rate limiting per user or API key, and multi-provider routing through a single unified endpoint. These features still work for existing deployments and are viable for teams that self-host the Apache 2.0 version.
Pricing compared
The three tools use meaningfully different billing models, which matters a lot at scale. Langfuse counts units (each trace, observation, or score is one unit), LangSmith counts traces with per-seat fees layered on top, and Helicone counts requests.
| Tier | Langfuse | LangSmith | Helicone |
|---|---|---|---|
| Free | $0 — 50k units/mo, 2 users, 30-day retention | $0 — 5k traces/mo, 1 seat, 14-day retention | $0 — 10k requests/mo, 1 seat, 7-day retention |
| Entry paid | $29/mo — 100k units, unlimited users, 90-day retention | $39/seat/mo — 10k base traces, unlimited seats | $79/mo — overage billing, unlimited seats, 30-day retention |
| Mid-tier | $199/mo — 100k units, 3-year retention, annotation queues | $39/seat/mo + $2.50/1k traces overage | $799/mo Team — 3-month retention |
| Enterprise | $2,499/mo — SSO, SCIM, audit logs, SLA | Custom — self-hosting option included | Custom |
| Self-host | Free (MIT); infra costs ~$3–4k/mo at scale | Enterprise license required | Free (Apache 2.0) |
| Overage (beyond included) | $8/100k units (volume discounts to $6) | $2.50/1k base traces; $5/1k extended traces | Usage-based, see pricing page |
What 'units' means in Langfuse billing
A Langfuse unit is any of: one trace (a full pipeline run), one observation (a span within a trace), or one score (an evaluation result attached to a trace). A single complex RAG pipeline run might generate one trace, five observations, and two scores — counting as eight units. At high volume, this adds up differently than per-trace pricing, so benchmark your typical trace depth before estimating costs.
LangSmith's seat-plus-usage model
LangSmith's Plus plan charges $39 per seat per month, then adds overage costs for traces beyond the 10k monthly base. For a five-engineer team generating 200k traces a month, that's $195 in seat fees plus roughly $475 in trace overage (190k extra traces at $2.50/1k), totaling ~$670/month before extended retention upgrades. LangSmith's pricing scales with team size independent of usage, which can be cheaper for low-traffic teams and more expensive for high-traffic solo projects.
Which one to pick
There is no universally right answer, but the decision tree is fairly tight once you know your stack and constraints.
Choose Langfuse when...
- Data sovereignty matters. You can self-host the entire MIT-licensed stack on your own infrastructure — no data leaves your network. Langfuse supports fully air-gapped deployments and is GDPR/HIPAA-aligned when self-hosted.
- You use multiple frameworks or want to stay portable. Langfuse supports 80+ frameworks (LlamaIndex, CrewAI, AutoGen, raw OpenAI SDK, Anthropic, etc.) and accepts OTel OTLP from any language.
- You need full evals plus human annotation. The annotation queue feature lets domain experts score live traffic alongside automated LLM-as-judge pipelines.
- You want predictable pricing at scale. Unit-based pricing with volume discounts is easier to forecast than per-seat plus per-trace stacking.
Choose LangSmith when...
- Your entire stack is LangChain or LangGraph. Two environment variables and every chain, agent, and retriever is traced with zero code changes. No other tool matches this for LangChain apps.
- You want the best evaluation and CI/CD integration. LangSmith's built-in evaluators, dataset management, and ability to gate deployments on eval scores are the deepest in class.
- You prefer managed SaaS. LangSmith is a fully managed cloud service; you do not need to maintain any infrastructure beyond your own app.
- You use LangGraph Studio. LangSmith is the backend for LangGraph Studio — the visual agent debugger — and the two are tightly integrated.
Consider Helicone (with caveats) when...
- You need observability in the next five minutes. Changing one base URL is genuinely the fastest integration possible — no SDK, no decorators, no code changes.
- You primarily care about cost tracking and caching. Helicone's multi-provider cost dashboard and Cloudflare-edge cache are still best-in-class for raw request/response logging.
- You are already on Helicone in production. The proxy is stable, self-hosting under Apache 2.0 is viable, and there is no announced shutdown.
- Avoid for new greenfield projects if you expect to need evaluations, prompt management, or a growing feature set — the maintenance-mode status makes that a risky bet.
Going deeper
Once you have chosen a tool and basic tracing is flowing, the next challenges are instrumentation depth, evaluation pipelines, and long-term data strategy.
OpenTelemetry as an escape hatch
Both Langfuse and LangSmith added OpenTelemetry OTLP ingestion (Langfuse as a long-standing feature; LangSmith in March 2025). If you instrument your app with the standard OTel Python or JavaScript SDK and the GenAI semantic conventions, you can point traces at either backend by changing one environment variable — your instrumentation code has zero vendor imports. This is the most portable setup and protects you from future pricing or product changes.
Self-hosting Langfuse: what the infrastructure actually looks like
Langfuse's self-hosted stack requires PostgreSQL (metadata and traces), ClickHouse (analytics queries), Redis (queuing and rate limiting), and S3-compatible object storage (large payloads). At small scale a single Docker Compose file handles all four. At production scale with millions of traces per month you need Kubernetes, with ClickHouse and Redis running as separate managed services. The Langfuse team publishes Helm charts and Terraform modules. Infrastructure cost at mid-market scale (~1–5M traces/month) typically runs $300–800/month on managed cloud services — often cheaper than the equivalent Langfuse cloud Pro plan.
Building an eval pipeline on top of tracing
The most powerful pattern is: trace everything in production, then run an LLM-as-a-judge pipeline that scores every trace on criteria you define (faithfulness, groundedness, tone). Both Langfuse and LangSmith support this natively. In Langfuse you configure online evaluations that trigger on new traces; scores are written back to the trace and appear in dashboards. In LangSmith you set up online evaluators that run after every trace is recorded. Either way, you end up with a dataset of scored production runs you can use to build regression test suites.
Prompt management as a separate concern
Deploying new prompts as code deploys is slow and couples prompt iteration to your release cycle. Both Langfuse and LangSmith offer centralized prompt registries where prompts are versioned, tagged (production/staging/dev), and fetched at runtime by your app. This means a product manager or prompt engineer can update a prompt and see the effect on live traces without touching the codebase or waiting for a deploy.
PII and data residency
LLM spans contain prompts, and prompts often contain user messages — which may include PII. Before sending traces to any cloud service, decide whether to: (a) mask PII fields before the span is created, (b) run a redaction filter in an OTel Collector pipeline between your app and the backend, or (c) self-host the entire observability stack so data never leaves your infrastructure. Langfuse's MIT self-hosted deployment is the most complete answer to (c). LangSmith's self-hosted Enterprise option requires an Enterprise license agreement.
FAQ
Is Langfuse really free to self-host?
Yes. The core platform — tracing, prompt management, evaluations, annotation queues, datasets, and the playground — is MIT-licensed with no feature restrictions when self-hosted. The only commercially-licensed self-hosted features are enterprise add-ons like SCIM provisioning, audit logs, project-level RBAC, and UI white-labeling. The infrastructure itself (PostgreSQL, ClickHouse, Redis, S3) carries its own cloud costs, but the software license is free.
Can I use LangSmith without LangChain?
Yes, using the @traceable decorator or the RunTree API directly. However, the zero-config automatic tracing that makes LangSmith compelling only works for LangChain and LangGraph. For non-LangChain apps, Langfuse's framework-agnostic approach and OpenTelemetry support typically require less instrumentation effort.
Is Helicone still safe to use in production?
For existing production deployments, yes. Helicone is stable, the Apache 2.0 license allows self-hosting indefinitely, and the team has committed to shipping security patches and new model support. The key risk is for new projects: with no new features planned post-Mintlify acquisition, you should expect the product to gradually fall behind Langfuse and LangSmith on evaluation and agent-tracing capabilities.
How does Langfuse pricing compare to LangSmith at scale?
At low trace volumes (under 50k/month), Langfuse's free Hobby tier is more generous than LangSmith's 5k-trace Developer tier. At mid-scale (200k traces/month, 5-person team), LangSmith at $39/seat/mo plus trace overages typically runs $500–800/month; Langfuse Pro at $199/month plus $8/100k units for overages is often cheaper. At very high volume (10M+ traces/month), Langfuse's volume discounts bring per-unit costs down significantly, and self-hosting the MIT version eliminates licensing costs entirely.
What is the difference between Helicone's proxy model and Langfuse's SDK model?
Helicone intercepts HTTP traffic at the network level — no code change beyond a base URL swap. It records the raw request and response but has no visibility into your application's internal pipeline structure. Langfuse's SDK model instruments your code at the function level, so it captures the parent-child relationship between pipeline steps (retrieval, reranking, LLM call) as a structured trace tree. The proxy approach is faster to set up; the SDK approach gives richer debugging information for complex pipelines.
Can I switch observability tools without rewriting all my instrumentation?
Yes, if you use OpenTelemetry. Both Langfuse and LangSmith accept OTLP traces, meaning you can instrument once with the standard OTel SDK (using the GenAI semantic conventions) and switch backends by changing one environment variable. Helicone's proxy model is inherently portable for simple request logging — swap back the base URL and you are off Helicone. The harder migration is moving away from LangSmith's @traceable decorator or Langfuse's @observe() decorator to OTel, but that is a one-time refactor.