AI/TLDR

What Are LLM Guardrails? Input and Output Validation for AI Apps

Understand what guardrails check before and after each model call, and how to add your first input and output validators.

BEGINNER12 MIN READUPDATED 2026-06-11

In plain English

LLM guardrails are the safety checks that run around a model call — one set on the way in, another on the way out. The guardrail on the input inspects what the user (or your own code) is about to send to the model. The guardrail on the output inspects what the model said before your app does anything with it. If either check fails, you block, fix, or retry instead of letting the bad thing through.

Here's the everyday analogy. Think of a nightclub with a bouncer at the door and a coat-check on the way out. The bouncer (input guardrail) decides who gets in — no fake IDs, no troublemakers, nobody trying to sneak in a weapon. The coat-check (output guardrail) makes sure that whatever leaves with you is actually yours and safe to carry out. The club itself — the LLM — is great at the party, but it has no judgment about the door. The guardrails supply that judgment.

Concretely, an input guardrail might reject an off-topic question, strip out a prompt-injection attempt, or refuse a message that's longer than your budget allows. An output guardrail might check that the model returned valid JSON, that it didn't leak a customer's email address, that it didn't recommend a competitor, or that it didn't confidently invent a refund policy. Guardrails are not part of the model — they're plain code (and sometimes a second, cheaper model) that you wrap around it.

Why it matters

An LLM will happily produce text that is fluent, confident, and completely unusable. It can return prose where you asked for JSON, an apology where you asked for an answer, a made-up citation, a slur, or a string that — passed straight to your database or shell — does real damage. The model has no built-in contract with your app. Guardrails are how you bolt that contract on.

The core problem guardrails solve is that you cannot trust raw model output, and you cannot trust raw user input either. Normal software validates its inputs and checks its outputs as a matter of habit — you'd never write user text straight into a SQL query. LLM apps tempt you to skip that step because the output looks finished. Guardrails restore the discipline: treat the model like an untrusted external service on both sides of the call.

Who should care

  • Anyone whose model output feeds another system — a database write, a function call, an email send. A malformed or malicious string here becomes a real bug or a real breach.
  • Teams handling sensitive data — PII, health, finance. An output guardrail that catches a leaked Social Security number before it reaches a screen is the difference between a near-miss and an incident.
  • Anyone building agents — an agent that calls tools on its own needs guardrails on every step, because a bad decision doesn't just print, it acts.
  • Customer-facing products — brand-safety, refusal of off-topic or abusive requests, and staying on-policy all live in guardrails.

What did guardrails replace? Mostly hoping the prompt was good enough. The old approach was to stuff every rule into the system prompt — "never reveal secrets, always return JSON, stay polite" — and pray the model obeyed every time. Prompts are a real defense and worth doing well (see prompt engineering), but a prompt is a request, not a guarantee. Guardrails are the enforcement layer that runs whether or not the model cooperated.

How it works

A guardrail sits at a choke point in your request path. The input guardrail runs before the model call; the output guardrail runs after. Each one inspects the text, applies a rule, and decides: pass it through, fix it, or reject it. Nothing reaches the model — or the user — until the relevant gate says yes.

Guardrails come in a few flavors depending on how they make the decision. Some are simple deterministic code; some call a model or an ML classifier to judge fuzzier things. You mix them based on what you're checking:

TypeHow it decidesCatches
Rule / regexPattern match in plain codeEmail or card numbers, banned words, max length
SchemaParse against a typed shapeBroken JSON, missing fields, wrong types, bad ranges
ClassifierA small ML model scores the textToxicity, off-topic, prompt injection, sentiment
LLM-as-judgeA second model grades the firstFaithfulness to sources, tone, policy nuance

The deterministic ones (rule and schema) are fast, free, and 100% predictable — always your first line of defense. The model-based ones (classifier and LLM-as-judge) handle the fuzzy stuff a regex can't, at the cost of extra latency and money. A good guardrail stack uses cheap deterministic checks first and only reaches for a model when the question genuinely needs judgment.

When a guardrail fails, it has to do something. The common failure actions form a ladder from gentle to hard:

Which action you pick depends on the stakes. A broken-JSON output is usually worth a silent retry. A leaked secret should be redacted and logged. An abusive input should be blocked outright. The art of guardrails is choosing the lightest action that still keeps the bad thing from reaching anyone who matters.

A minimal example

You don't need a framework to start — you need two small functions wrapped around your model call. Here's the smallest useful pair: an input guardrail that rejects over-long messages and an output guardrail that forces the answer into a typed schema, retrying once if the model returns garbage. The schema check uses Pydantic, the standard data-validation library in Python — the same trick that powers most structured-output helpers.

guardrails.pypython
import json
from pydantic import BaseModel, ValidationError, field_validator
from anthropic import Anthropic

client = Anthropic(api_key="sk-...")  # placeholder

# --- Output schema: the contract the model MUST satisfy ---
class Refund(BaseModel):
    approved: bool
    amount_usd: float

    @field_validator("amount_usd")
    @classmethod
    def non_negative(cls, v: float) -> float:
        if v < 0:
            raise ValueError("amount cannot be negative")
        return v

# --- Input guardrail: cheap, deterministic, runs first ---
def check_input(text: str) -> str:
    if len(text) > 2000:
        raise ValueError("input too long")  # block before spending a token
    return text

# --- Output guardrail: parse + validate, retry once on failure ---
def ask_refund(question: str, retries: int = 1) -> Refund:
    check_input(question)
    prompt = (
        f"{question}\n\n"
        'Reply ONLY with JSON: {"approved": bool, "amount_usd": number}'
    )
    for attempt in range(retries + 1):
        msg = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=256,
            messages=[{"role": "user", "content": prompt}],
        )
        try:
            return Refund(**json.loads(msg.content[0].text))
        except (json.JSONDecodeError, ValidationError) as e:
            # Feed the error back so the retry can self-correct
            prompt += f"\n\nYour last reply was invalid: {e}. Try again."
    raise ValueError("model never produced valid output")

print(ask_refund("Customer wants $20 back for a late order. Approve it."))

That's a complete guardrail loop in under 40 lines. The input check blocks an abusive payload before you pay for a single token. The output check guarantees that whatever your app receives is a real Refund object with a non-negative amount — never a freeform paragraph, never a negative number. And the retry feeds the validation error back to the model, which is often enough for it to fix its own mistake on the second try.

Guardrails vs evals

Beginners constantly confuse guardrails with evals, because both are about quality. The difference is when they run and what they protect. Guardrails run live, on every single request, and protect the current user from a single bad response. Evals run offline, on a fixed test set, and tell you whether your app as a whole got better or worse after a change.

They work as a team. An eval might reveal that 4% of your answers leak PII; that finding tells you to add a PII guardrail. Later, a guardrail blocks a real leak in production; you turn that real example into a new eval case so a future change can't reintroduce it. Guardrails catch the live miss, evals prove the fix generalizes, and the loop feeds testing and observability alike. Neither replaces the other.

The tool landscape

You can hand-roll guardrails like the example above, and for many apps that's the right call — a few regexes and a Pydantic schema cover most of what you need. When you want pre-built validators for harder problems (toxicity, PII detection, prompt-injection scoring), a couple of open-source frameworks package those up:

  • Guardrails AI — a Python framework with a Hub of ready-made input/output validators you compose into guards. Strong at structured-output enforcement and risk detection; runs as a library or a standalone server.
  • NVIDIA NeMo Guardrails — a toolkit focused on conversational systems, with a small modeling language (Colang) for defining what the bot may and may not do. Splits rails into input, dialog, retrieval, execution, and output stages.
  • Pydantic — not a guardrail framework, but the workhorse for the schema half of the job. If your only guardrail is "the output must match this typed shape," Pydantic alone gets you there.
  • Provider-side filters — most model APIs ship their own safety classifiers and refusal behavior. Treat these as a baseline, not your whole defense — they don't know your app's policies.

Going deeper

Once basic input and output checks are in place, a harder set of guardrail problems shows up — the ones that separate a demo-grade safeguard from a production one.

Guardrails on streaming output

The clean retry pattern above assumes you have the whole answer before you check it. But many apps stream tokens to the user as they arrive, which means a banned word or a leaked secret may already be on screen before your output guardrail ever sees the full text. Production systems handle this with partial-output checks: validating chunks as they stream, buffering risky spans, or running a fast classifier on a short trailing window. It's a genuine tension — streaming feels instant, but instant output is output you haven't validated yet.

Guardrails for agents and tool calls

A chatbot's worst output is bad text. An agent that can run code, send emails, or move money has a worse worst case: a bad action. So agent guardrails wrap the tool call, not just the text — allow-lists of permitted tools, argument validation before execution, spend or rate caps, and human-in-the-loop approval for anything irreversible. The guardrail moves from "is this string safe to show?" to "is this action safe to take?", which is a higher bar.

Adversarial inputs and the limits of guardrails

Attackers actively try to slip past guardrails — a prompt injection hidden in a retrieved document, a jailbreak phrased to dodge your classifier, Unicode tricks that defeat a naive regex. This is the domain of red-teaming: deliberately attacking your own app to find the gaps. The sober truth is that no single guardrail is bulletproof, which is why serious systems layer them — defense in depth — and assume any one layer can fail.

The latency and cost budget

Every guardrail you add costs time and sometimes money, and they sit in the hot path where users feel the delay. A regex is free; an LLM-as-judge on every output can double your latency and your bill. Mature setups order guardrails cheapest-first (bail out early on a fast deterministic check), run independent model-based checks in parallel, and reserve the expensive judge for high-stakes paths only. Guardrails are a safety/latency trade-off — the goal is maximum protection for the smallest budget, not a check on every conceivable risk.

FAQ

What are LLM guardrails in simple terms?

LLM guardrails are safety checks that run before and after a model call. The input guardrail inspects what you send to the model; the output guardrail inspects what the model returns before your app uses it. If a check fails, you block, fix, or retry instead of letting the bad input or output through.

What's the difference between input and output guardrails?

Input guardrails run before the model call and catch bad or unsafe requests — over-long messages, off-topic questions, prompt-injection attempts. Output guardrails run after the call and catch bad responses — broken JSON, leaked PII, off-policy or toxic text. Most production apps use both.

How do I add guardrails to my LLM app?

Start small and hand-rolled: a function that validates input (length, allowed topics) before the call, and a schema check (e.g. with Pydantic) on the output after the call, retrying once with the error fed back if it fails. Add a framework like Guardrails AI or NeMo Guardrails only when you need pre-built validators for things like toxicity or PII.

Are guardrails the same as prompt engineering?

No. A prompt is a request to the model — it asks for good behavior but can't guarantee it. A guardrail is enforcement code that runs whether or not the model obeyed the prompt. Good apps use both: a strong prompt to encourage the right output, and a guardrail to verify it actually happened.

What's the difference between guardrails and evals?

Guardrails run live on every request and protect the current user from one bad response. Evals run offline on a fixed test set and measure whether your app as a whole got better or worse. They feed each other — a guardrail failure in production becomes a new eval case so the bug can't return.

Do guardrails slow down my app?

Deterministic guardrails like regex and schema checks are effectively free. Model-based ones — a toxicity classifier or an LLM-as-judge — add real latency and cost because they make an extra call. The fix is ordering: run cheap checks first, run model-based checks in parallel, and reserve the expensive ones for high-stakes paths.

Further reading