AI/TLDR

What Is LangSmith? Tracing and Evaluating LangChain Apps

Learn to read a LangSmith trace, find the step where your chain went wrong, and set up your first evaluation dataset.

INTERMEDIATE11 MIN READUPDATED 2026-06-12

In plain English

LangSmith is an observability and evaluation platform built by the LangChain team. When your LLM app runs — whether it is a simple chatbot, a multi-step chain, or a LangGraph agent — LangSmith records every step: the exact prompt sent to the model, the model's response, every tool call and its output, how many tokens were used, and how long each step took. All of that lands in a web dashboard where you can inspect it, replay it, and measure whether it is working correctly.

A useful analogy: think of LangSmith as a flight data recorder for your AI app. You wouldn't fly a plane in production without a black box. Yet most LLM apps ship without any record of what the model actually received, what it returned, or which step caused the bad output the user complained about. LangSmith is that black box — except it is also a dashboard, an eval harness, and a prompt testing environment all in one.

LangSmith is a hosted SaaS product at smith.langchain.com. It works with any Python or TypeScript code — not just LangChain. You can trace a raw OpenAI call, a LangGraph agent, or a custom function just by setting two environment variables or adding a single decorator. For LangChain and LangGraph apps, tracing is fully automatic once the API key is set.

Why it matters

Debugging an LLM app without a tracer is like debugging a web server with no logs. You know the output is wrong, but you don't know which step produced it, what the model actually saw, or whether the problem is in the prompt, the retriever, or the tool parsing. LangSmith closes that gap by recording every step so you can reproduce the bug reliably.

Beyond debugging, LangSmith addresses a second problem: knowing whether your app is getting better or worse over time. When you tweak a prompt or swap a model, it is easy to eyeball a few outputs and convince yourself things improved. But five examples is not a test. LangSmith's evaluation layer lets you run every change against a curated dataset of real examples and get a score you can trust.

  • Debugging regressions: when a refactored chain starts hallucinating, you compare traces from before and after to find the divergence point.
  • Prompt iteration: tweak the system prompt, run the dataset, see whether quality went up or down — in minutes, not days.
  • Model swaps: replace gpt-4o with a cheaper model and immediately measure the quality drop against your real test cases.
  • Production monitoring: stream live traffic to LangSmith, run automated evaluators on every trace, and get alerted the moment quality degrades.
  • Team collaboration: share a trace URL with a colleague so they can see exactly what the model received and returned, without needing to reproduce the bug locally.

How it works

LangSmith's data model has three main concepts: runs, traces, and datasets. Understanding these three things gives you the mental model to use every feature in the product.

Runs and traces

A run is the record of a single function call — one LLM call, one tool invocation, one retrieval step. Every run captures inputs, outputs, start time, end time, token counts, and any error that occurred. A trace is the full tree of runs produced by one top-level user request. The root run represents the entry point (for example, the LangGraph ainvoke call), and every nested LLM call, tool call, and retrieval appears as a child run beneath it. This parent-child tree is what you see in the LangSmith trace viewer.

Enabling tracing: two lines of setup

For LangChain and LangGraph apps, you enable tracing with two environment variables. LangSmith automatically intercepts every LLM call, chain step, and tool invocation — no code changes required.

bashbash
export LANGSMITH_API_KEY="ls__your_key_here"
export LANGSMITH_TRACING=true
# Optional: group traces into a named project
export LANGSMITH_PROJECT="my-agent"

For code that does not use LangChain — for example, a plain OpenAI call inside your own function — you add the @traceable decorator from the langsmith Python package. LangSmith wraps the function, creates a run every time it is called, and attaches it to any parent trace in the current context automatically.

pythonpython
import openai
from langsmith import traceable
from langsmith.wrappers import wrap_openai

client = wrap_openai(openai.OpenAI())

@traceable(name="answer-question")
def answer_question(question: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": question}],
    )
    return response.choices[0].message.content

result = answer_question("What is the capital of France?")
# A trace appears in smith.langchain.com automatically

Datasets and evaluations

A dataset in LangSmith is a collection of examples, where each example has an input and (optionally) a reference output. You can build a dataset by exporting real production traces, writing examples by hand, or generating them synthetically. Once you have a dataset, you run an evaluation against it: LangSmith calls your app on each example, collects the outputs, and scores them using one or more evaluators.

Evaluators can be heuristic (exact string match, regex, JSON schema check), LLM-as-judge (another model grades the output against a rubric you define), or a custom Python function with any business logic. Results appear in a comparison table so you can see exactly which examples regressed when you changed the prompt.

Reading a trace: what to look for

Opening a trace in LangSmith shows a nested waterfall of runs. Learning to read it quickly is a skill that pays off every time you have a bad output to investigate.

  1. Find the root run and check its total latency. If the request took 30 seconds, you immediately know something stalled.
  2. Expand LLM runs and read the exact prompt that was sent. Nine times out of ten, a wrong output traces back to a wrong prompt — a missing instruction, a hallucinated system message, or a retrieval result that contaminated the context.
  3. Check token counts on each LLM run. If a run is using 8 000 tokens on a model with a 4 000 token limit, you have a context overflow problem that may not surface as a clean error.
  4. Inspect tool runs for unexpected inputs or error outputs. A tool call with a malformed argument is often the root cause of a confusing final answer.
  5. Look for error runs — they appear in red. Click into one to see the full exception and the state at the time it occurred.

Running your first evaluation dataset

The fastest way to set up offline evaluation is with the langsmith Python SDK. The pattern is: create a dataset, define your app as a target function, pick an evaluator, and call evaluate.

pythonpython
from langsmith import Client
from langsmith.evaluation import evaluate, LangChainStringEvaluator

client = Client()

# 1. Create a dataset (or reuse an existing one)
dataset = client.create_dataset("qa-smoke-test")
client.create_examples(
    inputs=[{"question": "What year was Python created?"}],
    outputs=[{"answer": "1991"}],
    dataset_id=dataset.id,
)

# 2. Define the function under test
def my_app(inputs: dict) -> dict:
    # replace with your real chain or agent call
    return {"answer": answer_question(inputs["question"])}

# 3. Run evaluation with an LLM-as-judge evaluator
results = evaluate(
    my_app,
    data="qa-smoke-test",
    evaluators=[LangChainStringEvaluator("qa")],
    experiment_prefix="baseline",
)
print(results.to_pandas())

LangSmith vs LangFuse: choosing between them

LangSmith is not the only LLM observability platform. Langfuse is its most-discussed open-source alternative, and the choice between them comes up frequently enough to deserve a direct comparison. Both trace and evaluate LLM apps; the differences are in ecosystem fit, data control, and pricing model.

LangSmithLangfuse
SourceClosed-source SaaSOpen-source (MIT); cloud or self-host
Best fitAll-in LangChain / LangGraph shopsFramework-agnostic; any LLM stack
LangChain integrationZero-config automatic tracingSupported via callback handler
Self-hostingEnterprise license requiredFree forever, first-class support
Free tier5 000 traces/month, 14-day retention50 000 observations/month
Paid plan$39/seat/month (Plus)~$59/month (Pro), no per-seat pricing
Data residencyUS servers on cloud planYour region when self-hosted
Prompt HubYes — LangChain Hub integrationSeparate prompt management UI

The short heuristic: if you are already using LangChain or LangGraph, LangSmith's automatic zero-config tracing and tight Hub integration make it the obvious starting point. If you are building on a different stack — raw OpenAI calls, LlamaIndex, Vercel AI SDK, a custom loop — or if EU data residency matters to your deployment, Langfuse's open-source self-hosted path is worth the extra setup.

Going deeper

Once you have basic tracing and offline evals working, the next layer is online evaluation — scoring real production traffic automatically rather than only curated test cases. LangSmith lets you attach evaluators to a project so they run on every incoming trace. This turns evaluation from a one-off CI check into a continuous smoke detector: you find out within minutes if a prompt change degraded quality on real queries, not just your hand-picked examples.

LLM-as-judge calibration

LLM-as-judge evaluators are powerful but can drift from human judgment. LangSmith ships an Align Evals workflow: when you disagree with a judge's verdict in the UI, you record a correction. LangSmith stores those corrections as few-shot examples and automatically injects them into future evaluation prompts, making the judge progressively more accurate without manual prompt rewrites. Tracking agreement rate over time tells you whether the judge is converging or drifting.

Annotation queues for human review

Annotation queues let you route a slice of production traces to a human reviewer — either a subject-matter expert or a QA team member. Reviewers score each trace against a rubric you define (correctness, tone, safety, groundedness). Those scores flow back into LangSmith as feedback on the original run, where they can seed a dataset or calibrate an LLM judge. This is the standard pattern for bootstrapping evals before you have enough volume for purely automated scoring.

Prompt Hub

LangSmith includes a Prompt Hub — a versioned registry for prompt templates. You commit a prompt version, assign it a tag (production, staging, experiment-v2), and then pull it from the registry at runtime using hub.pull(). This decouples prompt iteration from code deploys: a product manager can push a new system prompt to staging, see its eval scores, and promote it to production without touching the app's codebase.

Tracing LangGraph agents

LangGraph agents produce rich nested traces in LangSmith automatically. Each graph node appears as a child run, so you can see exactly which node called which tool, what state it received, and what state it returned. For long-running agentic tasks, LangSmith groups all runs produced by a single thread ID under one trace, even if they span multiple HTTP requests — a critical feature when your agent pauses for human input and resumes hours later.

The broader LLMOps picture

LangSmith covers one critical slice of LLMOps: observability and evaluation. A mature production setup pairs it with structured experiment tracking (recording which model version, prompt hash, and retriever config produced which eval score), a CI gate that fails the deployment if eval scores drop below a threshold, and cost monitoring to catch token regressions before they appear on the billing statement. LangSmith's experiment comparison view handles much of this, but integrating its API into your CI pipeline is the step that makes evaluations enforceable rather than advisory.

FAQ

Do I need to use LangChain to use LangSmith?

No. LangSmith works with any Python or TypeScript code. For LangChain and LangGraph apps, tracing is automatic once you set LANGSMITH_TRACING=true and LANGSMITH_API_KEY. For other code, wrap functions with the @traceable decorator or use wrap_openai() to auto-trace OpenAI calls.

What is a LangSmith trace and how is it different from a run?

A run is the record of a single function call — one LLM call, one tool invocation, one retrieval step. A trace is the full tree of runs produced by one top-level request. The trace is the root run plus all its nested child runs, visualised as a waterfall in the LangSmith UI.

How do I create an evaluation dataset in LangSmith?

You can create a dataset three ways: export real production traces by clicking 'Add to dataset' on any trace, create examples manually in the UI, or use the Python SDK to programmatically upload input/output pairs. Once a dataset exists, you run it with the evaluate() function from the langsmith package, passing your app function and one or more evaluators.

What evaluator types does LangSmith support?

LangSmith supports heuristic evaluators (string match, regex, JSON schema), LLM-as-judge evaluators (a model grades outputs against a rubric), human annotation via queues, pairwise comparisons between two model outputs, and custom Python or TypeScript functions that return a numeric or categorical score.

Is LangSmith free to use?

Yes, for development. The free Developer tier gives you 5 000 traces per month with 14-day data retention and a single seat. The Plus plan is $39/seat/month and adds 10 000 traces plus $0.50 per 1 000 trace overages. Enterprise pricing is custom and includes self-hosting options.

When should I choose LangFuse over LangSmith?

Reach for Langfuse when you are not using LangChain or LangGraph, when EU data residency is a compliance requirement, or when you want a fully self-hosted, open-source (MIT) solution with no per-seat pricing. If your whole stack is LangChain/LangGraph and you want the tightest integration with the least setup, LangSmith is the easier starting point.

Further reading