In plain English
When you write a normal prompt, you hand the model a block of text and hope it produces something in the shape you want. The model is free to ramble, skip a field, wrap the answer in a friendly preamble, or invent a category you never offered. You wrote the request; the model controls the whole response. Guidance flips that. It is a programming style (and a Python library, guidance) that lets you write the prompt and the response together, interleaving fixed text you control with small slots the model is allowed to fill — and nothing else.

Think of it like a paper form versus an open letter. An open letter is a blank page: the writer can put anything anywhere. A form already has the labels printed — Name:, Date:, a checkbox for yes or no — and the person only fills the blanks. Guidance turns your prompt into that form. You print the structure; the model only writes inside the boxes you leave open, and you can even restrict a box to digits only, one of these three words, or valid JSON.
Because you wrote the labels yourself, they cost no generation — the model never has to produce the word Name:, you already typed it. The model only spends effort on the genuinely unknown parts. That single idea — you write the scaffolding, the model fills constrained gaps — is what people mean by guided or controlled generation.
Why it matters
The moment you wire an LLM into real software, free-form text becomes a liability. Your code expects a number, a true/false, a JSON object with three exact keys, or one label from a fixed list. A plain prompt requests that shape; it does not enforce it. One run in fifty comes back as "Sure! Here's the JSON you asked for:" followed by almost-valid JSON with a trailing comma, and your parser crashes in production.
What controlled generation fixes
- Format reliability. If you constrain a slot to a regex or a grammar, an invalid answer is not generatable. You don't validate-and-retry after the fact — the bad output literally cannot be produced, so it never reaches your parser.
- No stray prose. Because you wrote the surrounding text, the model can't add a greeting, a disclaimer, or a markdown code fence around the data. The slot is the only thing it writes.
- Closed choices stay closed. Ask for a sentiment label in a normal prompt and you might get
positive,Positive,mostly positive, orupbeat. Constrain it to exactlypositive | negative | neutraland only those three strings can come out. - Cheaper and faster. The fixed text you supply is prefilled, not generated. The model skips the tokens you already wrote, so a heavily-templated interaction can use fewer generated tokens than the same task done as one big free-form prompt.
Who needs this? Anyone doing extraction (pull fields out of a document into a record), classification (map text to a fixed set of labels), routing (pick which tool or branch to take), or multi-step generation where each step feeds the next. It is closely related to the output format problem covered in getting structured output from a prompt — Guidance is one of the strongest ways to actually guarantee that format rather than hope for it.
How it works
Under the hood, an LLM generates one token at a time. At each step it produces a probability for every possible next token, then samples one. Constrained generation works by masking that list: before sampling, it sets the probability of every token that would break your rule to zero, so the model can only choose from the tokens that keep the output valid. Guidance is the layer that lets you describe those rules — as plain text, regex, grammars, loops, and conditionals — woven directly into the prompt.
The key mental model is an alternating loop. Your program is a sequence of two kinds of segment: fixed text you supply and constrained slots the model fills. Guidance walks through them in order, feeding your text to the model as context and letting the model generate only inside the slots, under whatever constraint you attached.
Crucially, the constraint is applied during decoding, not checked afterward. When a slot is limited to one of three labels, the very first generated token already steers the model down one of only three paths; subsequent tokens are forced to complete a valid label. There is no failed attempt to discard — the invalid tokens were never on the table.
Where the program lives
In the library, you build this template in normal Python. You add fixed strings, then call a generation primitive (often written gen(...)) for each slot, passing a regex, a stop string, or a select([...]) of allowed options. Loops and if statements are just Python — they decide how many slots you open and what text goes between them — which is why people say Guidance lets you put control flow inside the prompt.
from guidance import models, gen, select
lm = models.Transformers("a-local-model") # any backend the lib supports
# Fixed text (you write it) is interleaved with gen()/select() slots
# (the model fills them, under the constraint you attach).
lm += "Classify the review sentiment.\n"
lm += "Review: The battery dies in an hour. Terrible.\n"
lm += "Sentiment: " + select(["positive", "negative", "neutral"], name="label")
lm += "\nConfidence (0-100): " + gen("score", regex=r"[0-9]{1,3}")
print(lm["label"]) # guaranteed one of the three strings
print(lm["score"]) # guaranteed to be digits onlyThe select([...]) slot can only emit one of the listed strings; the gen(..., regex=...) slot can only emit text matching the pattern. Everything outside the slots — the labels Sentiment: and Confidence: — is text you already wrote, so the model spends zero generation on it. The result is captured by name (lm["label"], lm["score"]) ready for your code to use.
A worked example with control flow
The single feature that sets Guidance apart from a plain "give me JSON" prompt is that the program decides what happens next based on what the model just produced. You can branch and loop in Python, and each branch opens different constrained slots. Here is a tiny router that first asks the model to pick an intent, then — only for the branch it chose — extracts exactly the fields that branch needs.
from guidance import gen, select
lm += "User message: 'Cancel my order #4471 please.'\n"
lm += "Intent: " + select(["order_status", "cancel", "other"], name="intent")
lm += "\n"
# Python control flow reads the model's own output and branches.
if lm["intent"] == "cancel":
lm += "Order number: #" + gen("order", regex=r"[0-9]{4,6}")
elif lm["intent"] == "order_status":
lm += "Looking up status...\n"
lm += "Tracking id: " + gen("tracking", regex=r"[A-Z0-9]{8,12}")
else:
lm += "Reply: " + gen("reply", stop="\n", max_tokens=60)Notice what cannot go wrong here. The intent is always one of three known strings, so the if/elif/else is exhaustive. If the model picks cancel, the order number is forced to be 4–6 digits — never a sentence, never a missing field. Each branch produces a different shape, but every shape is guaranteed by construction. A single free-form prompt asking the model to "figure out the intent and return the right fields as JSON" gives you none of these guarantees.
Guidance vs other ways to control output
Constrained generation is not the only way to get structured output, and it is not always the right one. The table compares the common approaches by what they actually guarantee and what they cost.
| Approach | How it works | Guarantee | Trade-off |
|---|---|---|---|
| Plain prompt ("return JSON") | Ask nicely in the instructions | None — best effort | Simplest; can fail unpredictably |
| Validate + retry | Parse the output, re-ask if it's malformed | Eventually valid | Extra calls, latency, can loop |
| Provider JSON / schema mode | The API enforces a JSON schema for you | Valid JSON of that schema | Hosted-only; tied to one provider |
| Guidance / constrained decoding | Mask invalid tokens during generation | Cannot produce invalid output | Needs token-level access to the model |
The dividing line is where you sit relative to the model. Guidance, Outlines, and LMQL apply constraints at the token level, which means they need access to the model's next-token probabilities — easiest with open models you run yourself (via Transformers, llama.cpp, vLLM, and similar). Hosted APIs that only return finished text can't be steered token-by-token from outside, which is why closed providers ship their own built-in structured-output or grammar features instead. Both are forms of constrained generation; they just live in different places.
What makes Guidance distinct even among the token-level tools is the interleaving — you don't just hand it one big grammar for the whole output, you stitch fixed text and constrained slots together in program order, with real loops and conditionals in between. That is the "control flow in the prompt" idea, and it shines for multi-step interactions rather than a single structured blob. Compare it conceptually with structuring prompts using XML or markdown, which organizes the input; Guidance instead constrains the output.
Common pitfalls
Controlled generation removes a whole class of bugs, but it introduces a few of its own. The constraint is a hard wall, and walls in the wrong place cause their own problems.
- A constraint forces an answer, even a wrong one. If you force the output to be one of three labels, the model will emit one of those three — even when the true answer is "none of these" or "I'm not sure." Always include an escape hatch like an
otherorunknownoption when the closed set might not cover reality. - Over-tight regex fights the model. A pattern that's too rigid (say, a date regex that forbids a format the model strongly wants to use) can push it into awkward, low-probability tokens and hurt quality. Constrain the shape you truly need, not every cosmetic detail.
- The grammar doesn't make the answer correct. Guidance guarantees the output is well-formed, not that it is true. A perfectly-valid JSON object can still contain a hallucinated value. Structure and factual accuracy are separate problems.
- Token-level access is required. These constraints need to see and mask the model's logits. If your only access is a text-in/text-out hosted endpoint, external Guidance-style masking isn't available — you'd use that provider's own structured-output feature instead.
- Constraints don't replace prompting. A clear instruction still matters. The constraint narrows form; the prompt still has to make the model want the right content. Pair both.
Going deeper
Once the basic generate-and-constrain loop clicks, a few deeper ideas are worth knowing as you push controlled generation into real systems.
Grammars, not just regex. A regex constrains a single flat string, but many outputs are nested — JSON, a function call, a small query language. For these, Guidance lets you express a context-free grammar: a set of rules describing how valid structures nest. The decoder then masks tokens against the grammar at every step, so a generated JSON object always has balanced braces and quotes. This is how "guaranteed valid JSON" actually works under the hood: the grammar makes a malformed structure unreachable.
Why it can be faster. Beyond skipping the fixed text you prefilled, a constrained decoder sometimes knows that only one token is possible next — for instance, right after an opening brace the next character must be a quote. When a step is fully determined, the engine can fill it without a full sampling decision, which trims work. The practical headline: heavy templating often means fewer generated tokens, and fewer generated tokens means lower cost and latency.
Stateful, multi-turn programs. Because the whole interaction is ordinary Python around constrained slots, you can build genuine programs: loop until a stopping condition the model itself signals, accumulate results across several gen() calls, or feed one slot's captured value into the constraint of the next. This blurs into agent territory — a program that decides what to ask the model next based on what it just answered, with each step's output shape guaranteed.
Where to go next. Constrained generation is the enforcement layer; clear prompting is still the foundation. Strengthen both by reading up on getting a specific output format from a prompt and on using constraints and negative instructions in prompts. The durable lesson: a plain prompt asks the model to behave, while controlled generation removes its ability to misbehave at the token level — so reach for it exactly where a broken shape would break your software.
FAQ
What is the Guidance library used for?
Guidance is a Python library for controlling exactly what an LLM generates. You interleave fixed text you write with constrained slots the model fills, optionally bounded by a choice list, a regex, or a grammar. It is mainly used for reliable structured output, classification, extraction, and multi-step prompts where the output shape must be guaranteed.
How is Guidance different from just asking the model to return JSON?
A plain "return JSON" prompt is a request the model can ignore or get subtly wrong. Guidance applies the constraint during decoding by masking any token that would break the format, so invalid output literally cannot be generated. You get a guarantee instead of a best effort, and the model can't wrap the data in extra prose.
What is the difference between Guidance and Outlines?
Both are constrained-decoding libraries that mask invalid tokens to force a valid shape, and both work best with open models you can run yourself. The main difference is style: Guidance emphasizes interleaving program control flow (loops, conditionals, fixed text, and slots) directly inside the prompt, while Outlines centers on generating output that matches a regex, JSON schema, or grammar. They solve overlapping problems.
Does Guidance work with hosted APIs like closed models?
Token-level constraints need access to the model's next-token probabilities, which is straightforward with open models run through backends like Transformers, llama.cpp, or vLLM. Purely hosted text-in/text-out endpoints can't be masked from outside, so closed providers offer their own built-in structured-output or grammar features instead. Check which backends your setup supports.
Does constrained generation stop the model from hallucinating?
No. Guidance guarantees the output is well-formed — valid JSON, a digit-only field, one of a fixed set of labels — but it does not guarantee the content is true. A perfectly structured answer can still contain a made-up value. Structure and factual accuracy are separate problems, so you still need verification for correctness.