AI/TLDR

What Is the Modern AI App Stack? The Pieces of an LLM Application

Understand every layer of an LLM application and which ones you can skip on day one.

BEGINNER11 MIN READUPDATED 2026-06-11

In plain English

An "AI app" sounds like one thing. It isn't. Almost every product you've used that talks to a large language model — a support chatbot, a "chat with your PDF" tool, a coding assistant — is actually a small pile of cooperating parts. The model is the engine. The rest of the stack is the car you build around it.

The AI app stack is just the collection of layers that turn a raw model into a working product: your own code, the model API you call, the glue that wires everything together, a place to store knowledge, and the dashboards that tell you whether it's working. None of it is magic. Most of it is ordinary software engineering with one unusual component bolted in the middle.

Picture a restaurant. The chef (the model) is brilliant but can't run the place alone. There's a dining room where orders come in (your app), a waiter who carries requests back and forth (the API), a kitchen pass that coordinates who does what (orchestration), a pantry stocked with ingredients the chef didn't memorize (your data), and a manager watching the floor to catch problems (observability). Take any one away and dinner still happens — just worse. The art of building AI apps is knowing which roles you actually need to hire on day one.

Why it matters

The single biggest mistake beginners make is thinking the model is the product. They pick "the best model," wire up a chat box, ship it, and then watch it hallucinate facts, leak prompts, cost a fortune, and break silently with no idea why. The model was never the hard part. The stack around it is.

Knowing the layers matters for three concrete reasons:

  • You stop reinventing solved problems. Need the model to answer from your documents? That's RAG and a vector store, not a clever prompt. Need it to take actions? That's tool use. Each problem has a known layer, and reaching for the right one saves weeks.
  • You can reason about cost, speed, and risk. Every layer adds latency and dollars. When a request is slow or expensive, you need a mental model of where the time and money go — the API call, the retrieval step, the reranking, the second model call — to fix the right thing.
  • You know what to skip. A weekend prototype needs maybe two layers. A product serving real users needs most of them. Treating those as the same project is how people burn out building infrastructure nobody asked for.

What did this replace? Five years ago, "adding AI" meant collecting a dataset, training a model, and standing up GPU servers — a months-long machine-learning project only big teams could attempt. Hosted model APIs collapsed that to a single HTTP call. The center of gravity moved from training models to assembling stacks around them. That shift is exactly why the role of the AI engineer exists at all.

How it works

Think of the stack as horizontal layers. A user request enters at the top, falls down through each layer to the model, and the answer climbs back up. Here is the full picture — most apps use a subset, not all of it.

1. The model API — the engine

At the core is a hosted model you reach through an LLM API: you send messages, it sends back generated text. Providers like Anthropic (Claude), OpenAI, and Google offer these, and you can also run open models locally when you need privacy or control. This layer is where prompts, function calling, and streaming live. Day one, this might be your entire stack.

2. Orchestration — the kitchen pass

Real apps rarely make one clean model call. They assemble a prompt, maybe retrieve some context, call the model, run a tool the model asked for, feed the result back, and loop until done. That coordination is orchestration. You can hand-write it, or lean on an agent framework like LangChain, LlamaIndex, or a provider SDK to handle the plumbing.

3. Data & retrieval — the pantry

The model only knows what it was trained on. To answer from your documents, you store them as embeddings in a vector database and retrieve the relevant bits at question time. This is the RAG layer, and it's how most products give a model knowledge it never trained on.

4. The ops layer — the manager

Once people depend on your app, you need to see what it's doing and stop it doing dumb things: caching to cut cost, observability to trace every call, guardrails to block bad outputs, and evals to know if changes help or hurt. This is the whole discipline of LLMOps.

Here's a single request flowing through a full-featured stack — a question about your private docs answered by a model:

The layers in detail

Each layer is a category of tools, not a single product. You mix and match. This table is your shopping list — what each layer does, a few real names, and whether a small project needs it.

LayerJobReal tools / approachesNeed it on day one?
Model APIGenerate text, call functionsClaude, GPT, Gemini, local Llama/MistralYes — this is the core
OrchestrationWire prompts, tools, loopsLangChain, LlamaIndex, provider SDKs, plain codeOften — even a little
Vector storeStore + search your datapgvector, Pinecone, Qdrant, Weaviate, Chroma, FAISSOnly if you do RAG
EmbeddingsTurn text into vectorsVoyage, OpenAI, Cohere, open modelsOnly if you do RAG
CachingAvoid repeat workPrompt caching, semantic cache, RedisNo — add when cost bites
ObservabilitySee every call + costLangSmith, Langfuse, Helicone, ArizeNo — add before real users
GuardrailsBlock bad in/outInput/output checks, moderation, validatorsNo — add as risk grows
EvalsMeasure qualityCustom test sets, LLM-as-judgeYes-ish — even a tiny set

Build a minimal stack in code

Talk is cheap. Here is a stack with exactly two layers — your code and a model API — that already does something useful. No framework, no database, no ops tooling. This is a complete, runnable AI app.

minimal_app.pypython
from anthropic import Anthropic

# Layer 1: your code. Layer 2: the model API. That's the whole stack.
client = Anthropic(api_key="sk-ant-...")

def ask(question: str) -> str:
    msg = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=400,
        system="You are a concise assistant. Answer in 2-3 sentences.",
        messages=[{"role": "user", "content": question}],
    )
    return msg.content[0].text

print(ask("Explain the AI app stack to a beginner."))

That's it — a real product foundation in a dozen lines. Now watch what happens when you add the next layer up: retrieval, so the model can answer from documents it never trained on. This is the most common third layer people add.

add_retrieval.pypython
# Same model API as before — we just add a tiny data layer in front of it.
docs = [
    "Our refund window is 30 days for physical goods.",
    "Support hours are 9am-6pm Eastern, Mon-Fri.",
]

def retrieve(question: str) -> str:
    # A real app uses embeddings + a vector store here. For a demo,
    # a keyword match shows the *shape* of the retrieval layer.
    hits = [d for d in docs if any(w in d.lower() for w in question.lower().split())]
    return "\n".join(hits) or "(no relevant docs found)"

def ask_grounded(question: str) -> str:
    context = retrieve(question)              # data layer
    prompt = (                                # orchestration: assemble the prompt
        f"Answer using ONLY this context:\n{context}\n\nQuestion: {question}"
    )
    return ask(prompt)                        # reuse the model API from before

print(ask_grounded("What is the refund window?"))

Two stack shapes: pipelines vs agents

Stacks come in two broad shapes, and which one you build changes everything downstream. The difference is who's in control of the flow: your code, or the model.

A pipeline is fixed plumbing: retrieve, then prompt, then return. You decide the order. Most production apps are pipelines because they're cheap, fast, and predictable. An agent hands the steering wheel to the model — it decides which tools to call and when it's done, in a loop. Agents unlock harder tasks but cost more and fail in stranger ways.

Going deeper

Once the basic stack clicks, the interesting questions are about the seams between layers — where cost, latency, and reliability actually live. A few directions worth knowing as you move toward production.

Routing and model tiers. You don't have to use one model for everything. A common production pattern is a router: send easy requests to a small, cheap, fast model and escalate only the hard ones to a frontier model. This single decision often cuts the bill by more than half. The router itself can be a rules check, a classifier, or a tiny model call — see cost and latency optimization.

Caching has two flavors. Prompt caching lets a provider reuse the processing of a long, repeated prefix (like a big system prompt) across calls, cutting cost and latency. Semantic caching goes further: if a new question is similar enough to one you've already answered, you skip the model entirely and return the cached answer. Both live in the ops layer and both pay for themselves fast at scale.

The Model Context Protocol (MCP). As stacks grow, wiring each tool and data source into each app by hand gets painful. MCP is an emerging open standard for plugging tools, files, and services into models through a common interface — think of it as a universal adapter for the orchestration and data layers. It's reshaping how the middle of the stack gets assembled.

Evaluation is the layer everyone skips and regrets. "It worked when I tried it" is not a test. Production stacks build a set of example inputs with known-good outputs and re-run them on every prompt or model change, often scoring with an LLM-as-judge. Without evals, every change is a blind guess and quietly drifting quality goes unnoticed until users complain.

Security cuts across every layer. Anything the model reads — a retrieved document, a tool result, a web page — is untrusted input that can carry prompt injection: hidden instructions aimed at hijacking your app. The more layers and tools you add, the larger the attack surface. Treat retrieved and tool-returned text as data, never as commands, and validate model outputs before acting on them.

The durable lesson: the stack is a set of dials, not a fixed recipe. The best AI engineers add the fewest layers that solve the real problem, measure relentlessly, and keep the flow as simple as the task allows. A good AI product UX usually reflects a stack that resisted the temptation to over-build.

FAQ

What are the parts of an LLM application?

At minimum, your own code and a model API. Add layers as needed: orchestration to wire prompts and tools, a vector store and embeddings for retrieval (RAG), and an ops layer of caching, observability, guardrails, and evals once real users depend on it. Most apps use a subset, not all of them.

Do I need a vector database to build an AI app?

No. You only need a vector database if you're doing retrieval — having the model answer from your own documents (RAG). Plenty of useful AI apps are just a model API plus a good prompt. Add a vector store like pgvector, Pinecone, or Qdrant only when you've confirmed the model needs knowledge it wasn't trained on.

What is orchestration in an AI app stack?

Orchestration is the coordination layer that wires everything together: assembling the prompt, retrieving context, calling the model, running any tools it asks for, and looping until the task is done. You can hand-write it in plain code or use a framework like LangChain, LlamaIndex, or a provider SDK to handle the plumbing.

What's the difference between a pipeline and an agent?

In a pipeline (or workflow), you hard-code the sequence of steps, so the path is the same every time — predictable, cheap, and easy to debug. In an agent, the model decides which steps and tools to use in a loop, which handles harder tasks but costs more and fails in stranger ways. Start with a pipeline and only go agentic when you must.

What's the simplest AI app stack to start with?

Two layers: your code and a hosted model API like Claude, GPT, or Gemini. That alone — a thoughtful prompt sent over an API — is a complete, useful app in a dozen lines of code. Add retrieval, caching, and observability later, one at a time, only when a concrete problem demands each one.

Which layer of the stack should I focus on as a beginner?

Start with the model API and prompting, since that's the core of every app. Once you've shipped something basic, the highest-leverage next layers are retrieval (RAG) if you need the model to know your data, and a tiny set of evals so you can tell whether changes help or hurt. Skip caching, guardrails, and heavy ops tooling until you have real users.

Further reading