AI/TLDR

What Is DSPy? Programming LLMs Instead of Prompting Them

Grasp DSPy's big idea — declare what you want with signatures and let a compiler find the prompt — and why it changes how you build with LLMs.

BEGINNER9 MIN READUPDATED 2026-06-11

In plain English

Most people build with LLMs by writing prompts — long strings of carefully worded instructions like "You are an expert assistant. Think step by step. Always respond in JSON...". You tweak a word here, add an example there, and pray the model behaves. This is prompt engineering, and most of the time it's done by hand, by feel.

DSPy (from Stanford NLP) takes a different bet. Its tagline is literally "programming — not prompting — language models." Instead of writing the exact prompt string, you declare what you want: the inputs, the outputs, and a way to score whether the answer was good. Then a compiler searches for the prompt — wording, examples, formatting — that makes your model hit that score. You write the spec; the machine writes the prompt.

Here's the everyday analogy. Hand-prompting is like writing assembly code: you control every instruction, and a tiny change can break everything. DSPy is like writing in a high-level language with a compiler — you say question -> answer, and the compiler figures out the messy machine details for you. You stopped hand-tuning prompts the way you stopped hand-allocating registers.

Why it matters

Hand-written prompts are fragile in ways that get worse as your project grows. DSPy exists to fix three specific pains.

  • Prompts break when the model changes. A prompt tuned for one model often degrades on the next one. With DSPy you re-run the compiler against the new model and it re-finds a good prompt — no rewriting strings by hand.
  • Tuning by feel doesn't scale. Eyeballing three examples isn't a test. DSPy makes you supply a metric and a handful of examples, then optimizes against real numbers instead of vibes. Improvement becomes measurable, not anecdotal.
  • Prompt spaghetti. Multi-step pipelines (retrieve, then reason, then format) turn into giant brittle prompt strings glued together. DSPy lets you compose small typed modules like normal code, so each piece is testable and reusable.

Who should care? People who've outgrown copy-pasting prompts: anyone building a RAG pipeline, a classifier, an extraction tool, or a multi-step agent and tired of babysitting prompt strings. If you're shipping one quick chatbot, plain prompting is fine. If you're maintaining an LLM system that has to stay accurate across model upgrades, DSPy is built for you.

What did it replace? Mostly the manual, artisanal half of prompt engineering — the endless A/B fiddling with wording and few-shot examples. DSPy reframes that fiddling as an optimization problem the computer can solve, much like how compilers replaced hand-written assembly. It doesn't replace knowing what a good prompt looks like; it automates the search for one.

How it works

DSPy has three core building blocks. Learn these three words and you understand the framework: signatures, modules, and optimizers.

Signatures — declare the task

A signature is a short, typed description of what goes in and what comes out. In its simplest form it's a string: "question -> answer". For real work you write a class with typed fields and a docstring describing the task. Crucially, a signature says nothing about the actual prompt wording — that's the compiler's job. You're declaring intent, not instructions.

Modules — pick a strategy

A module wraps a signature with a reasoning strategy. dspy.Predict just asks the model directly. dspy.ChainOfThought makes it reason step by step before answering. dspy.ReAct lets it call tools in a loop, the way an AI agent does. Same signature, different module = different behavior, with zero prompt rewriting. Modules are just Python objects, so you compose them into bigger programs like Lego.

Optimizers — let the compiler tune it

An optimizer (the docs also call these teleprompters) takes your program, a set of example inputs, and a metric that scores outputs, then automatically searches for the best prompt — including which few-shot examples to include and how to phrase instructions. Named optimizers include BootstrapFewShot, MIPROv2, and GEPA. This compile step is where "self-improving" comes from.

Put together, the loop is: you write a tiny declarative program, hand the optimizer some labeled examples and a scoring function, and it runs the program against those examples many times — keeping the prompt variations that score best. The output is the same program, now carrying an optimized prompt under the hood. You never edited a prompt string yourself.

DSPy in code: a working example

Here's the whole idea in a short Python program. We configure a model, declare a task with a signature, wrap it in a chain-of-thought module, and call it. No prompt string anywhere.

basic_dspy.pypython
import dspy

# 1) Point DSPy at any LLM (OpenAI, Anthropic, a local model, etc.).
lm = dspy.LM("anthropic/claude-sonnet-4-5", api_key="sk-ant-...")
dspy.configure(lm=lm)

# 2) Declare the task as a SIGNATURE: what goes in, what comes out.
class Classify(dspy.Signature):
    """Classify a customer message into a support category."""
    message: str = dspy.InputField()
    category: str = dspy.OutputField(desc="billing, technical, or general")

# 3) Wrap it in a MODULE. ChainOfThought makes the model reason first.
classify = dspy.ChainOfThought(Classify)

# 4) Call it like a normal function. DSPy builds the prompt for you.
result = classify(message="My card was charged twice this month!")
print(result.category)    # -> billing
print(result.reasoning)   # the step-by-step rationale, free with CoT

Now the part that makes DSPy DSPy — optimization. Give it labeled examples and a metric, and it tunes the prompt for you:

optimize_dspy.pypython
import dspy

# A few labeled examples (you'd typically use 20-200).
trainset = [
    dspy.Example(message="I was double charged", category="billing").with_inputs("message"),
    dspy.Example(message="The app keeps crashing", category="technical").with_inputs("message"),
    dspy.Example(message="What are your hours?", category="general").with_inputs("message"),
]

# A metric: 1.0 if the predicted category matches the gold label, else 0.0.
def accuracy(example, prediction, trace=None):
    return float(example.category == prediction.category)

# Compile: the optimizer searches for the best prompt + few-shot examples.
optimizer = dspy.BootstrapFewShot(metric=accuracy)
tuned = optimizer.compile(classify, trainset=trainset)

# `tuned` is the same program, now carrying an optimized prompt.
print(tuned(message="Why is my invoice wrong?").category)

DSPy vs prompt engineering vs other frameworks

DSPy isn't the only way to build LLM apps, and it overlaps with tools you may already know. Here's where it sits.

The key distinction: LangChain and LlamaIndex are about orchestration — wiring models, tools, vector stores, and data loaders into a flow — but you still write the prompts. DSPy is about the prompts themselves — generating and optimizing them automatically. They solve different layers, and people do combine them: orchestrate with one, optimize the prompts with DSPy.

DSPy also lives in the lightweight, typed corner of the ecosystem alongside the minimal provider SDKs. If you're still deciding whether you even want a framework, see what an agent framework is and whether you need one.

Common pitfalls

  • Optimizing with no real metric. DSPy's whole edge is the compiler, and the compiler is only as good as your metric. A vague or wrong scoring function tunes your prompt toward the wrong target. Spend your effort here, not on wording.
  • Too few or unrepresentative examples. The optimizer learns from your examples. A handful of cherry-picked easy cases produces a prompt that aces the demo and flops in production. Use enough varied, realistic examples.
  • Treating it as 'no prompting needed at all.' You still choose signatures, field descriptions, and modules — those are design decisions. DSPy automates the search for prompt wording, not the thinking about what your task is.
  • Forgetting compilation costs tokens. Running the optimizer makes many LLM calls against your training set, which costs time and money up front. It's an investment that pays off at inference time, but it isn't free — budget for it.
  • Skipping evaluation. Compiling against a train set and never checking a separate test set is how you overfit. Hold out examples and measure, like any ML workflow — see LLM evals.

Going deeper

Once the signature → module → optimizer loop clicks, here's the more advanced territory worth knowing.

The optimizer zoo. BootstrapFewShot bootstraps few-shot examples by running your program and keeping the traces that score well. MIPROv2 jointly optimizes instructions and examples using a more sophisticated search. GEPA uses a reflective, evolutionary approach — it inspects failures in natural language and proposes improved instructions. Different optimizers suit different budgets and tasks; choosing one is itself a small experiment.

Optimizing weights, not just prompts. DSPy can go beyond prompts and use your examples to drive fine-tuning of the underlying model — letting you compile a program down into tuned weights for a smaller, cheaper model. This blurs the line between prompt optimization and model training, and it's a major reason the project calls itself a framework for building systems, not just a prompt tool.

Composing big programs. Because modules are plain Python objects, you nest them: a retrieval module feeds a chain-of-thought module feeds a formatting module, all inside one larger module with its own signature. The optimizer can then tune the whole pipeline end to end against a single metric — something nearly impossible to do by hand-tuning each prompt in isolation.

Production realities and open problems. Compiled prompts can be long (they embed selected examples), so watch your context window and cost. Optimization results vary run to run, so version and cache your compiled programs rather than recompiling on every deploy. And the honest hard part remains the metric: turning "is this answer good?" into a number is the same unsolved challenge at the heart of all LLM evaluation — DSPy makes optimization automatic, but it can only optimize toward whatever you can measure.

FAQ

What is DSPy used for?

DSPy is a Python framework for building LLM applications where you declare the task with typed signatures and let an optimizer compile the actual prompt. It's used for classifiers, extraction, RAG pipelines, and multi-step agents — anywhere you want measurable, maintainable LLM behavior instead of hand-tuned prompt strings.

What is the difference between DSPy and prompt engineering?

Prompt engineering means writing and tweaking prompt strings by hand. DSPy automates that: you declare inputs, outputs, and a scoring metric, and its compiler searches for the prompt wording and few-shot examples that maximize the metric. You still design the task, but you stop hand-writing prompt text.

What are signatures and modules in DSPy?

A signature is a short typed declaration of a task's inputs and outputs, like question -> answer. A module wraps a signature with a reasoning strategy — dspy.Predict for a direct answer, dspy.ChainOfThought for step-by-step reasoning, dspy.ReAct for tool-using loops. The same signature can run under different modules with no prompt rewriting.

Is DSPy better than LangChain?

They solve different problems, so it's not a direct contest. LangChain orchestrates models, tools, and data sources into a flow but leaves the prompts to you. DSPy focuses on generating and optimizing the prompts themselves. Many teams orchestrate with LangChain or LlamaIndex and optimize prompts with DSPy.

Who created DSPy?

DSPy comes from the Stanford NLP group and is open source on GitHub at stanfordnlp/dspy. The name stands for Declarative Self-improving Python, and the project's tagline is "programming — not prompting — language models."

Do I need labeled data to use DSPy?

To just run programs, no — signatures and modules work without any training data. To use the optimizers (the part that tunes prompts automatically), yes: you need a set of example inputs and a metric that scores outputs. Even a few dozen labeled examples is often enough to start seeing gains.

Further reading