AI/TLDR

What Does a Production LLM Stack Look Like? A Reference Architecture

See the full layered architecture of a production LLM app and understand what each layer is for and when to add it.

BEGINNER10 MIN READUPDATED 2026-06-12

In plain English

A production LLM stack is all the infrastructure that sits around your language model so it can reliably serve real users at scale. The model itself — GPT-4o, Claude, Llama, or any other — is actually a small part of the picture. What surrounds it does most of the heavy lifting.

Think of it like a commercial kitchen. The chef (the LLM) is the talent, but without the pass-through window (gateway), the ticket system (observability), the refrigerators (cache), the health inspector (guardrails), and the recipe testing process (evals), you cannot run a restaurant. You can have the world's best chef and still produce an inconsistent, expensive, unreliable dining experience if the kitchen infrastructure is missing.

On day one of a side project, you need almost none of this: call the model API directly, ship the feature, see if anyone cares. But the moment you have real users, real costs, or real consequences for a bad output, each layer of the production stack starts earning its place.

Why it matters

Moving from a prototype to production exposes a cluster of problems that do not exist in demos. Understanding the full stack helps you anticipate those problems before they hit users.

Cost surprise

LLM API costs scale with token volume, and token volume grows non-linearly as features get added — longer system prompts, retrieval context, multi-turn history. Without a caching layer and per-request cost tracking, a single viral day can generate an unexpected four-figure bill. Semantic caching alone can eliminate 15–30% of API calls for typical conversational workloads.

Reliability gaps

Provider APIs go down. Rate limits get hit. Model versions get deprecated. Without a gateway that handles failover and retry logic, any of these events becomes a user-facing outage. The gateway is the reliability buffer between your app and the non-deterministic world of hosted model providers.

Silent quality degradation

Prompts drift. Model providers silently update their underlying models. A change in retrieval logic can degrade answer quality without raising an error. Without continuous evaluation, you discover quality regressions from user complaints — which is always too late.

Safety and compliance

In any customer-facing product, the LLM can be prompted by adversarial users. Guardrails intercept prompt injections, block policy-violating outputs, and strip PII before it reaches the model — work that provider-level content filters do not fully cover.

How it works

A mature production LLM stack has six distinct layers. A request from your app travels through all of them on the way to the model and back. Each layer can be a hosted service, an open-source library, or a thin wrapper you write yourself.

Layer 1: The LLM Gateway

The gateway is a single API endpoint your application calls instead of calling model providers directly. It handles authentication (your app never touches raw provider API keys), rate limiting, per-customer spend budgets, intelligent routing (send cheap tasks to a small model, complex ones to a large model), and automatic failover when a provider goes down. Purpose-built LLM gateways like Portkey, LiteLLM, and OpenRouter typically add only 2–20ms of latency, which is negligible against 500–2000ms LLM response times.

Layer 2: The Cache Layer

There are two distinct caching mechanisms, and they operate at different levels:

  • Semantic cache — compares the incoming prompt against past prompts using vector similarity. If a new request is semantically equivalent to a cached one (above a cosine-similarity threshold), the stored response is returned in 3–8ms instead of making a full LLM call. Studies of production workloads show 15–31% of queries qualify for semantic cache hits.
  • Provider prefix cache — operates inside the model provider's inference infrastructure. Long, repeated prefixes (system prompts, document context) are kept in the provider's KV cache across requests. Anthropic's prefix caching delivers up to 90% cost reduction and 85% latency reduction for long prompts; OpenAI's automatic caching saves 50% on cached input tokens.

Layer 3: Guardrails

Guardrails run on every request — both the user input (input rails) and the model's response (output rails). Common open-source tools include NVIDIA's NeMo Guardrails (a Colang-based DSL for defining conversational policies), Llama Guard 4 (a 12B-parameter safety classifier from Meta for content moderation on both text and images), and Guardrails AI (an output validation and correction library). Many teams also run Microsoft Presidio at ingress to detect and strip PII before it reaches the model.

Layer 4: Retrieval and Memory

Most production apps need context the base model does not have: company documents, conversation history, user data, tool results. This layer handles the retrieval-augmented generation (RAG) pipeline — embedding queries, searching a vector store, and assembling the retrieved context into the prompt. Conversation memory management also lives here: deciding how much history to retain and how to compress it as context windows fill up.

Layer 5: Observability

Every request generates a trace: the full prompt, model response, latency, token counts, and cost. Aggregate traces become dashboards — P50/P99 latency, error rates, spend per user, and regression alerts. Tools like Langfuse (open-source, self-hostable, 50K free events/month on cloud), LangSmith (tight LangChain integration, 5K free traces/month), and Helicone (proxy-based, zero-SDK setup, 100K free requests/month) all serve this role.

Layer 6: Evals

Evals are the automated test suite for your LLM feature. They run on a labeled dataset of prompt-response pairs and measure correctness, tone, safety, and task-specific quality. Evals catch quality regressions when you change a model, update a prompt, or modify retrieval logic — the same role unit tests play for regular code. Frameworks like Braintrust, Deepeval, and the built-in eval tooling in LangSmith all provide structured pipelines for running and scoring eval suites.

When to add each layer

Not every team needs the full stack on day one. The right time to add each layer is when the pain it solves becomes concrete.

LayerAdd it when...Skip for now if...
GatewayYou use more than one model provider, or a provider outage would hurt your businessSingle provider, low traffic, outage is tolerable
Semantic cacheYour workload has many similar or repeated prompts (support bots, FAQ assistants)Every prompt is unique (creative writing, code generation)
Provider prefix cacheYou have a long system prompt or large document context repeated across callsShort prompts with no repeated prefix
GuardrailsYou have a public-facing product or handle sensitive dataInternal tool with trusted users only
Retrieval / RAGThe model needs facts it was not trained on (your docs, recent data)General-purpose chat where base model knowledge is enough
ObservabilityYou want to debug quality issues or track cost by customerLocal dev / early prototype stage
EvalsYou are about to change the model, prompt, or retrieval logicStill in initial prompt iteration

Common tools at each layer

The LLM tooling ecosystem is young and moves fast. The table below shows the most widely adopted options as of mid-2026, with brief notes on their positioning.

LayerOpen-source / self-hostHosted / managed
GatewayLiteLLM (Python, 100+ provider integrations)Portkey, OpenRouter, Requesty
Semantic cacheGPTCache, Redis with vector indexPortkey cache, LiteLLM cache
GuardrailsNeMo Guardrails, Guardrails AI, Presidio, Llama Guard 4Lakera Guard, Galileo Luna guardrails
Vector store / RAGChroma, Qdrant, pgvector, WeaviatePinecone, Weaviate Cloud, MongoDB Atlas Vector
ObservabilityLangfuse, Phoenix (Arize)LangSmith, Helicone, Braintrust
EvalsDeepeval, RAGAS, promptfooBraintrust, LangSmith Evals, Confident AI

The gateway-as-control-plane pattern

A popular architectural pattern is to treat the gateway as the central control plane for everything except tracing. Routing, caching, budget enforcement, and basic guardrails all flow through one gateway process. Observability is wired as a side-channel (webhook, SDK callback, or proxy mirror) so it does not add latency to the critical path. Evals run offline in CI/CD, not in the hot path.

Going deeper

As your stack matures, a few more concerns emerge that the basic reference architecture does not address.

Model routing and cost tiers

Not all requests need the same model. A well-designed gateway can route by intent classification: send simple factual lookups to a cheap small model (e.g., GPT-4o mini), escalate complex reasoning to a frontier model only when needed. This tiered routing can cut per-request costs by 60–80% on mixed-complexity workloads without a visible quality change to users.

Agent loops and multi-step tracing

When your LLM feature becomes an agent — calling tools, spawning sub-agents, running loops — the flat request/response model breaks down. You need traces that capture the whole tree: which tool calls were made, how long each step took, where costs accumulated. OpenTelemetry-compatible tracing (which Langfuse, Phoenix, and LangSmith all support) is the right foundation here.

Prompt versioning and rollback

Prompts are code. They should live in version control, have a deployment process, and be rollback-able. A prompt change that degrades quality can be harder to detect than a code bug — it passes all type-checks and unit tests — so pairing prompt deployments with an eval run before rollout is the production-grade pattern.

Self-hosted vs. managed models

For high-volume or privacy-sensitive workloads, self-hosting an open model (via vLLM, Ollama, or TGI on your own GPUs or a cloud GPU provider) can reduce cost and eliminate provider dependency. The vLLM Production Stack project provides a Kubernetes-native reference implementation with a request router that maximises KV cache reuse across a cluster of model replicas. Self-hosting shifts ops burden from the gateway to the inference infrastructure but removes per-token API cost entirely above your fixed GPU cost.

Continuous evals in CI/CD

The gold standard for production LLM quality is running your eval suite automatically on every pull request that changes a prompt, retrieval configuration, or model version. This surfaces regressions before deployment, the same way automated tests do for regular code. The eval suite grows over time: every production bug that slips through becomes a new test case.

FAQ

Do I need all these layers for a simple LLM chatbot?

No. A simple chatbot calling a single provider directly with basic logging is a legitimate production setup for low-traffic or internal tools. The full stack is the destination as you scale, not the starting point. Add layers when the specific pain they solve (cost, reliability, safety, quality) becomes real for your use case.

What is the difference between a gateway and an observability tool?

A gateway sits on the critical path of every request and enforces routing, rate limits, caching, and failover in real time. An observability tool records what happened and lets you analyze it later. Some gateways include basic logging (making them also observability tools), but deep tracing of multi-step agent calls usually requires a dedicated observability layer.

Does adding a gateway increase latency?

Purpose-built LLM gateways typically add 2–20ms of overhead, which is negligible against LLM response times of 500ms to several seconds. Cache hits actually reduce overall latency dramatically because they skip the LLM call entirely — a semantic cache hit returns in 3–8ms.

What is semantic caching and how is it different from regular caching?

Regular caching requires an exact match on the request key. Semantic caching uses vector similarity to find functionally equivalent past prompts — so 'What is the capital of France?' and 'Tell me the capital city of France' both hit the same cache entry. It is built on a vector index and uses a configurable cosine-similarity threshold to decide what counts as a match.

What are LLM guardrails and are they the same as provider content filters?

Provider content filters (e.g., OpenAI's moderation API) are coarse-grained and applied by the provider before your app receives a response. Application-level guardrails are more granular: they enforce your policies, strip PII before the prompt reaches the model, detect prompt injection attacks, validate that outputs match an expected schema, and can block specific topics or formats. You need both for a comprehensive safety layer.

When should I start running evals?

The best time is before you make your first model or prompt change. Even a small eval set of 20–50 hand-labeled examples lets you detect regressions early. The practical trigger is: the moment you are considering changing the model, rewriting a prompt, or modifying retrieval logic, create the eval set first so you can measure the impact of the change.

Further reading