AI/TLDR

What Is Context Engineering? Managing What the Model Sees

Understand why the field moved from wordsmithing prompts to curating everything in the context window, and the core moves of context engineering.

BEGINNER9 MIN READUPDATED 2026-06-11

In plain English

Every time you call an LLM, it wakes up with no memory. It doesn't remember your last request, your codebase, your customers, or the conversation from five minutes ago. The only thing it knows about your problem is the text inside this one request — a bounded space called the context window. If something isn't in the window, for the model it does not exist.

Picture a brilliant consultant with total amnesia. Every meeting starts from zero, and the only thing they get is the briefing folder you hand them at the door. Pack that folder with the right three pages and they're spectacular. Hand them a 400-page dump of unsorted emails and they'll skim, miss the key memo on page 212, and confidently improvise. The consultant's talent never changed — the folder did.

Context engineering is the discipline of packing that folder. It means deciding what goes into the context window — instructions, examples, documents, chat history, tool outputs — in what shape, in what order, and just as importantly, what gets cut. Where classic prompt engineering asks "how should I phrase the instruction?", context engineering asks the bigger question: "what should the model be looking at while it works?"

Why it matters

In a demo, the prompt is one clever sentence. In a real product, the instruction you wrote by hand might be 5% of the tokens the model receives. The other 95% is conversation history, retrieved documents, tool definitions, file contents, and the results of earlier tool calls — all of it assembled by your code, not typed by a human. Context engineering is quality control for that assembly line.

It matters because most "the model is being dumb" moments are actually context failures. The fact it needed wasn't in the window. Or it was there, buried under 80,000 tokens of irrelevant logs. Or two versions of the same document contradicted each other and the model picked the stale one. You can rewrite the instruction forever — none of those problems live in the instruction.

Huge context windows didn't make this go away. Modern models accept hundreds of thousands of tokens, but their ability to use what's in the window degrades as it fills — a measurable effect engineers call context rot, with a famous special case: facts buried in the middle of a long context are recalled worse than facts near the start or end ("lost in the middle"). On top of that, every token costs money and latency on every single call. A window is a budget, not a dumpster.

Who should care:

  • Anyone building a chatbot. History grows every turn, and how you trim or summarize it decides whether turn 40 is still coherent.
  • Anyone doing retrieval (RAG). Retrieval is context engineering: choosing which documents earn one of the few slots in the window.
  • Anyone building agents. Agents generate their own context — tool results, file reads, search output — at a furious rate, and left unmanaged it drowns the original task.

What did it replace? The magic-words era. Early prompt lore obsessed over incantations — the perfect persona, the lucky phrasing, tipping the model imaginary money. As models improved, those tricks stopped mattering, and the durable wins turned out to come from what information reaches the model, not which adjectives you used to ask for it.

How it works

Start with what's actually in the window on a real request. From the model's point of view it's all one continuous stream of tokens, but it helps to see the layers your code stacks up:

Only the system prompt and maybe the examples are hand-written. Everything else is assembled per request by code you control, and that code has four basic moves:

  • Select. Decide which candidates earn a slot: which 3 of 200 documents, which past turns, which tools. Retrieval, ranking, filtering, and deduplication all live here.
  • Compress. Make what's included smaller without losing the signal: summarize old chat turns, strip boilerplate from documents, cut a giant tool output down to the relevant lines.
  • Order and shape. Put standing instructions where the model weighs them reliably (the start), the live question where it's freshest (the end), and fence documents with clear delimiters so sources don't bleed into each other.
  • Isolate. Keep dangerous mixtures apart: one job per window where possible, untrusted text clearly separated from instructions, token-heavy side quests pushed into separate calls.

In a running system those moves form a pipeline that executes before every model call:

Every step trades information for attention. The core insight is that the window is not free real estate — it's the model's working memory, and like yours, it works best when the desk is clean.

Context engineering vs prompt engineering

These aren't rival camps — one contains the other. Prompt engineering is about the sentences you write. Context engineering is about the entire payload the model receives, of which your sentences are one part.

A useful rule of thumb: if you're typing into a chat box, prompt wording is most of the game. The moment you're building a system — a bot with memory, a retrieval pipeline, an agent — the leverage shifts. A mediocre instruction over exactly the right five documents beats a perfect instruction over fifty wrong ones, every time.

A token budget in code

The simplest context-engineering artifact you can build today is a token budget: a hard cap on input size, a priority order for what gets in, and a fixed layout for assembly. Here's the whole idea in plain Python, no frameworks:

context_budget.pypython
# Fit the most valuable pieces into a fixed input budget.

def estimate_tokens(text: str) -> int:
    return max(1, len(text) // 4)  # rough: ~4 chars per English token

BUDGET = 8_000  # input tokens; leave headroom for the answer

# (priority, label, text) — lower number = more important.
# Swap the placeholder strings for your retriever / history store.
candidates = [
    (0, "system",   "You are a support agent. Answer only from the context."),
    (0, "question", "How do I reset a customer's 2FA?"),
    (1, "docs",     "...top 3 retrieved help-center articles..."),
    (2, "history",  "...summary of older turns + last 4 messages verbatim..."),
    (3, "extras",   "...retrieved articles ranked 4-10..."),
]

kept, used = {}, 0
for priority, label, text in sorted(candidates):
    cost = estimate_tokens(text)
    if used + cost > BUDGET:
        continue  # drop whole pieces; never truncate mid-sentence
    kept[label] = text
    used += cost

# Fixed layout: instructions first, the live question last.
layout = ["system", "history", "docs", "extras", "question"]
prompt = "\n\n".join(kept[k] for k in layout if k in kept)
print(f"assembled {used} of {BUDGET} tokens")

Three habits in that snippet carry straight into real systems: pieces get dropped whole (a half-truncated document is worse than no document), the layout is fixed (instructions first, question last, every time), and usage is logged so you can reconstruct what the model saw when something goes wrong. In production you'd swap the 4-characters-per-token estimate for a real tokenizer count — most provider SDKs expose one — and the placeholder strings for your retriever and history store.

Common pitfalls

A handful of mistakes account for most context-shaped failures:

  • Stuffing "just in case." Adding everything that might be relevant feels safe and measurably hurts — long, noisy prompts degrade accuracy well before they hit the token limit.
  • Unbounded chat history. Appending every turn forever means turn 50 ships with 49 turns of baggage. Trim or summarize on a schedule, not when things break.
  • Burying the lede. The key fact sits in the middle of 60,000 tokens of filler — exactly where recall is weakest. Put what matters near the start or the end.
  • Contradiction smuggling. Yesterday's retrieved doc says one thing, today's says another, and both are in the window. Dedupe and prefer fresh sources before assembly, because the model won't reliably arbitrate.
  • Flying blind. If you can't reproduce exactly what was in the window for a failed request, you can't fix it. Log the fully assembled prompt, always.

Going deeper

Two physical realities shape advanced context work. The first is position effects: the "Lost in the Middle" research showed that recall across a long context is U-shaped — strong at the edges, weak in the middle — which is why production layouts conventionally pin instructions to the top and the live question to the bottom. The second is caching: major LLM APIs cache the repeated prefix of a prompt, so requests that share an identical opening run cheaper and faster. That turns ordering into an economic decision — stable content (system prompt, tool definitions) goes first, volatile content (retrieved docs, the user message) goes last, and reshuffling components per request silently destroys your cache hits.

Long-running agents add a survival problem: an agent doing real work generates tool results faster than any window can absorb. The standard responses are compaction — when the window nears its limit, summarize the transcript so far and continue with the summary plus the most recent messages — and tool-result clearing, where old outputs are replaced with a short stub once they've served their purpose. Coding agents like Claude Code do this routinely, and some providers now offer it server-side.

Context isolation goes a step further: spawn a sub-agent with its own clean window for a token-heavy side quest ("read these thirty files and report back"), and let it return one distilled paragraph instead of thirty files. Memory inverts the whole model: instead of cramming state into the window, the agent writes notes to files or a store outside it and pulls them back on demand. The window stops being the state and becomes a cache over external state.

There's also a security dimension. Every pipe that feeds your window is an attack surface: a retrieved web page or forwarded email can carry instructions aimed at the model — that's prompt injection — so fencing untrusted text, labeling sources, and deciding what never enters the window are defensive moves as much as quality moves.

The open problems are honest ones. There's no reliable way to know in advance which slice of available information the model will actually need; retrieval evals measure proxies, not outcomes; and every compression step — summarization, compaction, trimming — is lossy in ways that only surface later. The mental model that holds up, echoed in Anthropic's engineering guidance, is that attention is the scarce resource and the window is its budget. Make every token pay rent, and most of the practice follows from there.

FAQ

Is context engineering just a rebrand of prompt engineering?

No — it's a superset. Prompt engineering covers the instructions you write; context engineering covers everything the model receives: instructions, chat history, retrieved documents, tool definitions, and tool results. Wording skills still matter, but they're one component of a larger assembly job.

Is context engineering the same thing as RAG?

RAG is one technique inside it. Retrieval nominates documents as candidates for the window; context engineering also decides how many make the cut, how they're formatted and ordered, how chat history is trimmed, and what gets dropped when the token budget runs out.

Do million-token context windows make context engineering unnecessary?

No. Bigger windows raise the ceiling, but accuracy still degrades as the window fills (context rot), facts in the middle of long contexts are recalled worse, and every extra token adds cost and latency to every call. Curation keeps winning even when space is abundant.

What is in the context window besides my message?

In a production app: the system prompt, tool definitions, few-shot examples, retrieved documents, the conversation so far, and the results of earlier tool calls. Your hand-written message is often the smallest slice of what the model actually reads.

How do I start doing context engineering on an existing app?

Log the fully assembled prompt for every request, then read the logs for your worst failures — you'll almost always find missing facts, buried facts, or junk. Then add a token budget with explicit priorities per component and a fixed layout. Measurement first, budget second.

Further reading