Langfuse vs LangSmith vs Helicone: Observability Tools Compared

Q: Can I use LangSmith without LangChain?

Yes, using the `@traceable` decorator or the `RunTree` API directly. However, the zero-config automatic tracing that makes LangSmith compelling only works for LangChain and LangGraph. For non-LangChain apps, Langfuse's framework-agnostic approach and OpenTelemetry support typically require less instrumentation effort.

Compare the three most popular LLM observability tools and know which one fits your stack, budget, and hosting needs.

INTERMEDIATE13 MIN READUPDATED 2026-06-12

In plain English

Every time your LLM app calls a model, several things can go wrong: the prompt may be malformed, the retrieved context may be irrelevant, the model may hallucinate, the response may be slow, or costs may spiral. LLM observability tools record what happened on every call — inputs, outputs, token counts, latency, cost — so you can debug failures and track quality over time.

Langfuse, LangSmith, and Helicone are the three tools teams reach for first. They all solve the same core problem — visibility into LLM calls — but they approach it differently, target different workflows, and charge differently. Picking the wrong one leads to either a vendor lock-in headache or a half-instrumented app that still can't tell you why a response went bad.

A useful analogy: think of an LLM app as a restaurant kitchen. Helicone is the ticket printer by the pass — it captures every order (request) and every plate (response) as it goes by, with minimal setup. LangSmith is the head chef's clipboard, tightly integrated with the LangChain cooking workflow, tracking each prep step and plating decision. Langfuse is a full kitchen management system — open-source, self-hostable, and covering tracing, prompt management, and graded evaluations across any recipe (framework).

Why it matters

Without an observability tool you are flying blind. A slow response in production could come from a bloated system prompt, a slow vector-DB query, or a model that started generating extra tokens — a flat log line won't tell you which. A quality regression after a prompt change is invisible until users complain, unless you have scores attached to every trace.

Debugging production failures. Trace trees show exactly which step in a multi-hop pipeline produced the bad output — retrieval, the model call, or post-processing.
Cost attribution. Token spend per endpoint, per user, or per feature is invisible without per-call instrumentation. Observability tools surface this automatically.
Quality tracking over time. Automated evaluation scores attached to traces let you detect prompt regressions the moment a new version ships, not a week later.
Prompt management. All three tools offer some form of prompt versioning so you can A/B test prompt changes without deploying new code.
Compliance and auditing. Self-hosted deployments keep sensitive prompt data — which often contains PII — off third-party servers.

How each tool works

The three tools use fundamentally different integration architectures, which is the most important thing to understand before choosing one.

// Integration architecture

Helicone

Proxy-based: change one base URL
No SDK required — any HTTP client
Logs requests and responses at the edge
Adds ~50-80 ms latency via proxy
Multi-provider: 100+ models via one endpoint

LangSmith

SDK-based: set two environment variables
Deep auto-instrumentation for LangChain
Uses internal RunTree data model
OpenTelemetry support added March 2025
Closed-source SaaS, enterprise self-host

Langfuse

SDK-based or OpenTelemetry OTLP
Framework-agnostic: 80+ integrations
Span/trace model with @observe() decorator
Open-source (MIT), self-host free
Also accepts OTel from any language

Helicone: the proxy model

Helicone sits in front of your LLM provider as an HTTP proxy. You replace https://api.openai.com with https://oai.helicone.ai and add your Helicone API key as a header. Every request and response passes through Helicone's infrastructure (built on Cloudflare Workers, ClickHouse, and Kafka), which logs the pair and returns it to your app unchanged. Because it operates at the HTTP level, Helicone works with any language or framework without installing a special SDK.

pythonpython

# Before: direct OpenAI call
from openai import OpenAI
client = OpenAI(api_key="sk-...")

# After: route through Helicone proxy (one-line change)
client = OpenAI(
    api_key="sk-...",
    base_url="https://oai.helicone.ai/v1",
    default_headers={"Helicone-Auth": "Bearer hc-..."},
)

LangSmith: the LangChain-native model

LangSmith integrates by reading two environment variables — LANGCHAIN_API_KEY and LANGCHAIN_TRACING_V2=true. Once set, every LangChain and LangGraph call is traced automatically with no code changes. Each step becomes a run in LangSmith's internal RunTree model, which mirrors LangChain's execution graph exactly. For non-LangChain code you can use the @traceable decorator to create custom runs.

pythonpython

import os
os.environ["LANGCHAIN_API_KEY"] = "ls-..."
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "my-project"

# All LangChain calls below are now traced automatically
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage

llm = ChatOpenAI(model="gpt-4o")
response = llm.invoke([HumanMessage(content="Explain attention")])

Langfuse: the span-tree model

Langfuse uses a span/trace model aligned with OpenTelemetry. The @observe() decorator wraps any Python function and creates a named span automatically, with parent-child nesting inferred from the call stack. You can also import Langfuse's OpenAI drop-in wrapper for zero-code auto-instrumentation of model calls, or send raw OTel spans via OTLP if you prefer vendor-neutral instrumentation.

pythonpython

import os
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-..."
os.environ["LANGFUSE_SECRET_KEY"] = "sk-..."

from langfuse.decorators import observe
from langfuse.openai import openai  # drop-in wrapper

@observe()  # creates a trace for the whole pipeline
def answer(query: str) -> str:
    docs = retrieve(query)       # nested span
    return call_model(docs, query)  # nested span

@observe(name="retrieve")
def retrieve(query: str) -> list[str]:
    # ... vector DB call ...
    return []

@observe(name="llm_call")
def call_model(docs, query):
    return openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": query}]
    )

// Data flow: SDK to backend

Your appcalls LLM, tools, retrieversInstrumentation layerSDK decorator / proxy / OTel SDKTelemetry exportasync HTTP to observability backendObservability backendLangfuse / LangSmith / Helicone cloud or self-hostedUI + APIstrace viewer, evals, prompt management, dashboards

Feature-by-feature comparison

The tools converge on basic tracing but diverge sharply on evaluations, prompt management, and data ownership. Here is how they stack up on the features that matter most when building production LLM apps.

Feature	Langfuse	LangSmith	Helicone
Open-source license	MIT (full core)	Closed-source	Apache 2.0
Self-hosting	Free, first-class	Enterprise license required	Free (Apache 2.0)
Integration model	SDK + OTel OTLP	SDK (env vars)	Proxy (one URL change)
Framework support	80+ frameworks, any LLM	LangChain/LangGraph native; others via SDK	Any HTTP client
Span-level tracing	Yes — full tree	Yes — RunTree model	Request/response pairs only
Multi-step agent tracing	Yes, deep nesting	Yes, deep nesting for LangGraph	Sessions (stitched after the fact)
Built-in evaluations	LLM-as-judge, heuristics, human annotation	LLM-as-judge, built-in evaluators, CI/CD	Basic only (no LLM-as-judge)
Prompt management	Yes — versioned, A/B, playground	Yes — versioned, playground	Limited
Dataset management	Yes	Yes	No
Cost tracking	Basic (token counts)	Basic (token counts)	Advanced (multi-provider, auto-calculated)
Built-in caching	No	No	Yes — edge caching via Cloudflare
Gateway / routing	No	No	Yes — 100+ providers, rate limiting, failover
Data retention (free tier)	30 days	14 days (base traces)	7 days
Active development	Yes (rapid)	Yes (rapid)	Maintenance mode since March 2026

Evaluations: where LangSmith and Langfuse pull ahead

Both Langfuse and LangSmith offer LLM-as-a-judge evaluation pipelines that score live traces on criteria like accuracy, relevance, and faithfulness. LangSmith ships built-in evaluator templates and tight CI/CD integration so you can gate a deployment on eval scores. Langfuse offers human annotation queues alongside automated evals, which is useful when you want domain experts to label ambiguous cases. Helicone offers only basic manual scoring with no LLM-as-judge capability.

Helicone's unique strengths (while active)

Before its maintenance-mode shift, Helicone's gateway layer provided genuine value beyond logging: edge caching (serving repeated identical requests without hitting the model), rate limiting per user or API key, and multi-provider routing through a single unified endpoint. These features still work for existing deployments and are viable for teams that self-host the Apache 2.0 version.

Pricing compared

The three tools use meaningfully different billing models, which matters a lot at scale. Langfuse counts units (each trace, observation, or score is one unit), LangSmith counts traces with per-seat fees layered on top, and Helicone counts requests.

Tier	Langfuse	LangSmith	Helicone
Free	$0 — 50k units/mo, 2 users, 30-day retention	$0 — 5k traces/mo, 1 seat, 14-day retention	$0 — 10k requests/mo, 1 seat, 7-day retention
Entry paid	$29/mo — 100k units, unlimited users, 90-day retention	$39/seat/mo — 10k base traces, unlimited seats	$79/mo — overage billing, unlimited seats, 30-day retention
Mid-tier	$199/mo — 100k units, 3-year retention, annotation queues	$39/seat/mo + $2.50/1k traces overage	$799/mo Team — 3-month retention
Enterprise	$2,499/mo — SSO, SCIM, audit logs, SLA	Custom — self-hosting option included	Custom
Self-host	Free (MIT); infra costs ~$3–4k/mo at scale	Enterprise license required	Free (Apache 2.0)
Overage (beyond included)	$8/100k units (volume discounts to $6)	$2.50/1k base traces; $5/1k extended traces	Usage-based, see pricing page

What 'units' means in Langfuse billing

A Langfuse unit is any of: one trace (a full pipeline run), one observation (a span within a trace), or one score (an evaluation result attached to a trace). A single complex RAG pipeline run might generate one trace, five observations, and two scores — counting as eight units. At high volume, this adds up differently than per-trace pricing, so benchmark your typical trace depth before estimating costs.

LangSmith's seat-plus-usage model

LangSmith's Plus plan charges $39 per seat per month, then adds overage costs for traces beyond the 10k monthly base. For a five-engineer team generating 200k traces a month, that's $195 in seat fees plus roughly $475 in trace overage (190k extra traces at $2.50/1k), totaling ~$670/month before extended retention upgrades. LangSmith's pricing scales with team size independent of usage, which can be cheaper for low-traffic teams and more expensive for high-traffic solo projects.

Which one to pick

There is no universally right answer, but the decision tree is fairly tight once you know your stack and constraints.

// Decision guide

Which LLM observability tool?start here

Pick LangfuseMIT open-source, self-host required, any framework, full evals + prompt mgmt

Pick LangSmithLangChain / LangGraph stack, managed SaaS, deep eval + CI integration

Pick Helicone (cautiously)zero-SDK proxy for raw LLM calls, cost tracking focus, existing deployment

Choose Langfuse when...

Data sovereignty matters. You can self-host the entire MIT-licensed stack on your own infrastructure — no data leaves your network. Langfuse supports fully air-gapped deployments and is GDPR/HIPAA-aligned when self-hosted.
You use multiple frameworks or want to stay portable. Langfuse supports 80+ frameworks (LlamaIndex, CrewAI, AutoGen, raw OpenAI SDK, Anthropic, etc.) and accepts OTel OTLP from any language.
You need full evals plus human annotation. The annotation queue feature lets domain experts score live traffic alongside automated LLM-as-judge pipelines.
You want predictable pricing at scale. Unit-based pricing with volume discounts is easier to forecast than per-seat plus per-trace stacking.

Choose LangSmith when...

Your entire stack is LangChain or LangGraph. Two environment variables and every chain, agent, and retriever is traced with zero code changes. No other tool matches this for LangChain apps.
You want the best evaluation and CI/CD integration. LangSmith's built-in evaluators, dataset management, and ability to gate deployments on eval scores are the deepest in class.
You prefer managed SaaS. LangSmith is a fully managed cloud service; you do not need to maintain any infrastructure beyond your own app.
You use LangGraph Studio. LangSmith is the backend for LangGraph Studio — the visual agent debugger — and the two are tightly integrated.

Consider Helicone (with caveats) when...

You need observability in the next five minutes. Changing one base URL is genuinely the fastest integration possible — no SDK, no decorators, no code changes.
You primarily care about cost tracking and caching. Helicone's multi-provider cost dashboard and Cloudflare-edge cache are still best-in-class for raw request/response logging.
You are already on Helicone in production. The proxy is stable, self-hosting under Apache 2.0 is viable, and there is no announced shutdown.
Avoid for new greenfield projects if you expect to need evaluations, prompt management, or a growing feature set — the maintenance-mode status makes that a risky bet.

Going deeper

Once you have chosen a tool and basic tracing is flowing, the next challenges are instrumentation depth, evaluation pipelines, and long-term data strategy.

OpenTelemetry as an escape hatch

Both Langfuse and LangSmith added OpenTelemetry OTLP ingestion (Langfuse as a long-standing feature; LangSmith in March 2025). If you instrument your app with the standard OTel Python or JavaScript SDK and the GenAI semantic conventions, you can point traces at either backend by changing one environment variable — your instrumentation code has zero vendor imports. This is the most portable setup and protects you from future pricing or product changes.

Self-hosting Langfuse: what the infrastructure actually looks like

Langfuse's self-hosted stack requires PostgreSQL (metadata and traces), ClickHouse (analytics queries), Redis (queuing and rate limiting), and S3-compatible object storage (large payloads). At small scale a single Docker Compose file handles all four. At production scale with millions of traces per month you need Kubernetes, with ClickHouse and Redis running as separate managed services. The Langfuse team publishes Helm charts and Terraform modules. Infrastructure cost at mid-market scale (~1–5M traces/month) typically runs $300–800/month on managed cloud services — often cheaper than the equivalent Langfuse cloud Pro plan.

Building an eval pipeline on top of tracing

The most powerful pattern is: trace everything in production, then run an LLM-as-a-judge pipeline that scores every trace on criteria you define (faithfulness, groundedness, tone). Both Langfuse and LangSmith support this natively. In Langfuse you configure online evaluations that trigger on new traces; scores are written back to the trace and appear in dashboards. In LangSmith you set up online evaluators that run after every trace is recorded. Either way, you end up with a dataset of scored production runs you can use to build regression test suites.

Prompt management as a separate concern

Deploying new prompts as code deploys is slow and couples prompt iteration to your release cycle. Both Langfuse and LangSmith offer centralized prompt registries where prompts are versioned, tagged (production/staging/dev), and fetched at runtime by your app. This means a product manager or prompt engineer can update a prompt and see the effect on live traces without touching the codebase or waiting for a deploy.

PII and data residency

LLM spans contain prompts, and prompts often contain user messages — which may include PII. Before sending traces to any cloud service, decide whether to: (a) mask PII fields before the span is created, (b) run a redaction filter in an OTel Collector pipeline between your app and the backend, or (c) self-host the entire observability stack so data never leaves your infrastructure. Langfuse's MIT self-hosted deployment is the most complete answer to (c). LangSmith's self-hosted Enterprise option requires an Enterprise license agreement.

FAQ

Is Langfuse really free to self-host?

Yes. The core platform — tracing, prompt management, evaluations, annotation queues, datasets, and the playground — is MIT-licensed with no feature restrictions when self-hosted. The only commercially-licensed self-hosted features are enterprise add-ons like SCIM provisioning, audit logs, project-level RBAC, and UI white-labeling. The infrastructure itself (PostgreSQL, ClickHouse, Redis, S3) carries its own cloud costs, but the software license is free.

Can I use LangSmith without LangChain?

Yes, using the @traceable decorator or the RunTree API directly. However, the zero-config automatic tracing that makes LangSmith compelling only works for LangChain and LangGraph. For non-LangChain apps, Langfuse's framework-agnostic approach and OpenTelemetry support typically require less instrumentation effort.

Is Helicone still safe to use in production?

For existing production deployments, yes. Helicone is stable, the Apache 2.0 license allows self-hosting indefinitely, and the team has committed to shipping security patches and new model support. The key risk is for new projects: with no new features planned post-Mintlify acquisition, you should expect the product to gradually fall behind Langfuse and LangSmith on evaluation and agent-tracing capabilities.

How does Langfuse pricing compare to LangSmith at scale?

At low trace volumes (under 50k/month), Langfuse's free Hobby tier is more generous than LangSmith's 5k-trace Developer tier. At mid-scale (200k traces/month, 5-person team), LangSmith at $39/seat/mo plus trace overages typically runs $500–800/month; Langfuse Pro at $199/month plus $8/100k units for overages is often cheaper. At very high volume (10M+ traces/month), Langfuse's volume discounts bring per-unit costs down significantly, and self-hosting the MIT version eliminates licensing costs entirely.

What is the difference between Helicone's proxy model and Langfuse's SDK model?

Helicone intercepts HTTP traffic at the network level — no code change beyond a base URL swap. It records the raw request and response but has no visibility into your application's internal pipeline structure. Langfuse's SDK model instruments your code at the function level, so it captures the parent-child relationship between pipeline steps (retrieval, reranking, LLM call) as a structured trace tree. The proxy approach is faster to set up; the SDK approach gives richer debugging information for complex pipelines.

Can I switch observability tools without rewriting all my instrumentation?

Yes, if you use OpenTelemetry. Both Langfuse and LangSmith accept OTLP traces, meaning you can instrument once with the standard OTel SDK (using the GenAI semantic conventions) and switch backends by changing one environment variable. Helicone's proxy model is inherently portable for simple request logging — swap back the base URL and you are off Helicone. The harder migration is moving away from LangSmith's @traceable decorator or Langfuse's @observe() decorator to OTel, but that is a one-time refactor.

// In plain English

// Why it matters

// How each tool works

Helicone: the proxy model

LangSmith: the LangChain-native model

Langfuse: the span-tree model

// Feature-by-feature comparison

Evaluations: where LangSmith and Langfuse pull ahead

Helicone's unique strengths (while active)

// Pricing compared

What 'units' means in Langfuse billing

LangSmith's seat-plus-usage model

// Which one to pick

Choose Langfuse when...

Choose LangSmith when...

Consider Helicone (with caveats) when...

// Going deeper

OpenTelemetry as an escape hatch

Self-hosting Langfuse: what the infrastructure actually looks like

Building an eval pipeline on top of tracing

Prompt management as a separate concern

PII and data residency

// FAQ

// Further reading

// Related

In plain English

Why it matters

How each tool works

Feature-by-feature comparison

Pricing compared

Which one to pick

Going deeper

FAQ

Further reading

Related