In plain English
A context window is the maximum amount of text a large language model can see and work with at one time. Not per message — in total. Your instructions, the entire conversation so far, any documents you pasted, and the answer the model is currently writing all have to fit inside this one shared space. When it's full, something has to go.
Picture a brilliant expert who has zero memory but an enormous whiteboard. The expert can reason about anything written on the board — but only what's on the board. Every time you ask a question, the entire conversation gets rewritten onto the whiteboard and the expert reads all of it from scratch before answering. The whiteboard has a fixed size. Once it fills up, the oldest notes get erased to make room for new ones. That whiteboard is the context window, and the erasing is why chatbots "forget" things you told them an hour ago.
The window isn't measured in words or pages. It's measured in tokens — the chunks of text models actually read, usually a word or a piece of a word. (If tokens are new to you, read what a token is first — it's a five-minute concept.) A useful rule of thumb for English: one token is about three-quarters of a word, so 1,000 tokens is roughly 750 words.
Sizes vary wildly by model. Early GPT-3-era models could hold about 2,000 tokens — a few pages of text. Modern frontier models advertise windows from around a hundred thousand tokens (a decent-sized novel) up to a million or more (a small codebase, hours of meeting transcripts). Bigger windows changed what's possible, but the core constraint never went away: the window is finite, and everything competes for it.
Why it matters
The context window explains the single most common chatbot complaint: "it forgot what I told it earlier." The model isn't broken and it isn't being lazy. The earlier message either fell out of the window entirely, or the app silently summarized or trimmed it to make room. Once you understand the window, a whole class of confusing behavior suddenly makes sense.
It also decides what a model can do in one shot. Can it review your full contract? Read an entire codebase? Digest a 3-hour meeting transcript? That's purely a question of whether the material fits in the window. If it doesn't, you need workarounds — chunking the input, summarizing as you go, or retrieval systems (RAG) that fetch only the relevant pieces.
Three groups should care, for three different reasons:
- Everyday users — knowing the window exists tells you when to start a fresh chat, why long conversations drift off the rails, and why pasting a huge document sometimes makes answers worse.
- Developers — every token you send costs money and time. The window is a hard engineering budget: system prompt, history, retrieved documents, and the reply all draw from the same account.
- Agent builders — tool definitions, schemas, and multi-step histories eat tokens shockingly fast. A long-running agent that never trims its context will hit the wall mid-task.
One more thing the window quietly replaced: the illusion of memory. LLMs are stateless — the model itself remembers nothing between API calls. Chat apps fake continuity by re-sending the whole history with every message. Even "memory" features in modern chatbots are engineering on top: the app saves notes about you and re-inserts them into the context window each time. The window is the only memory the model ever has.
How it works
Under the hood, an LLM is a function: token sequence in, predicted tokens out. There are no separate compartments for "instructions" versus "chat" versus "documents". Everything gets flattened into one long sequence of tokens, and that sequence must fit inside the window. Here's what's typically packed in:
Notice the last layer: the model's output counts too. The window is shared between input and output. If a model has a 128,000-token window and you stuff 127,000 tokens of input into it, the longest possible reply is about 1,000 tokens — no matter what the model's separate max-output setting says. Cram the window full and you get truncated, cut-off answers.
Because the model is stateless, every single turn of a conversation rebuilds the whole sequence from scratch:
Why is the window finite at all? Because of how attention works: when the model generates each new token, it compares that token against every token already in the sequence. More context means more comparisons, more memory, and more compute — the cost grows fast as the sequence gets longer. The window size is the point where the model's architecture and the hardware running it say "enough."
What counts toward the window
- Every message, from every role — system, user, and assistant alike.
- Tool and function definitions — JSON schemas are sneakily token-heavy.
- Images and files in multimodal models — they're converted into tokens too, often hundreds per image.
- Hidden reasoning — "thinking" models burn window space on chains of thought you may never see.
- The model's own previous answers — long replies make the next turn more expensive.
And when the total goes over the limit? Either the API rejects the request outright, or the app silently drops the oldest content. Both failure modes — and how to handle them gracefully — get their own article: what happens when you exceed the context window.
Counting what fits: a hands-on example
You don't have to guess how full your window is — you can count. OpenAI's tiktoken library tokenizes text exactly the way many models do, so you can measure your prompt before sending it. Here's a minimal token-budget check, including the trimming logic chat apps use when history outgrows the window:
import tiktoken
# cl100k_base is the encoding used by many OpenAI chat models
enc = tiktoken.get_encoding("cl100k_base")
CONTEXT_WINDOW = 128_000 # the model's hard limit (input + output)
MAX_OUTPUT = 4_096 # space we reserve for the reply
system_prompt = "You are a support agent for Acme. Be concise and polite."
history = [
"User: My order #4821 never arrived.",
"Assistant: Sorry to hear that! Let me look into order #4821.",
]
new_message = "User: It's been two weeks. I want a refund."
def count(text: str) -> int:
return len(enc.encode(text))
used = count(system_prompt) + sum(count(m) for m in history) + count(new_message)
budget = CONTEXT_WINDOW - MAX_OUTPUT
print(f"Tokens used: {used} / {budget}")
if used > budget:
# Drop oldest messages first — this is exactly why chatbots 'forget'
while used > budget and history:
used -= count(history.pop(0))
print(f"Trimmed history. Now at {used} tokens.")Two practical notes. First, real chat APIs add a handful of extra tokens per message for role formatting, so treat counts like this as close estimates, not exact figures. Second, the rule-of-thumb conversions are worth memorizing: ~4 characters per token, ~0.75 English words per token, and roughly 100,000 tokens for a typical novel.
Terminology and common misconceptions
The vocabulary around context windows is messy, and a few persistent myths trip up beginners. Here's the decoder ring:
| Term | What it actually means |
|---|---|
| Context window | Total tokens the model can handle at once — input and output combined |
| Context length | The same thing; the two terms are used interchangeably |
| Max output tokens | A separate, smaller cap on how long a single reply can be |
| Effective context | How much of the window the model uses well — often less than advertised |
| "Memory" | An app feature that saves notes and re-inserts them into the window — not model memory |
Myths worth killing early
- "The model remembers our past chats." It doesn't. The app re-sends history, or injects saved notes. The model itself forgets everything the instant a request ends.
- "A bigger window means a smarter model." No — it means the model can see more, not reason better. A model with a huge window and weak reasoning will confidently mangle a long document.
- "If it fits, the model uses all of it equally." Models pay more attention to the start and end of long contexts than the middle — a measurable effect with its own name and its own article (see Going deeper).
- "The window is measured in words." Tokens. Code, non-English languages, and dense formatting all tokenize less efficiently than plain English prose, so the same character count can cost very different amounts.
Going deeper
The quadratic wall. Standard transformer attention compares every token against every other token, so compute scales roughly with the square of sequence length: doubling the context quadruples the attention work. Long-context models exist because of engineering that attacks this wall — memory-efficient kernels like FlashAttention, sparse and sliding-window attention patterns, and the architectural tricks behind today's million-token models.
The memory bill arrives at inference. Even after the compute problem is tamed, every token in the window leaves a residue: the model caches intermediate attention values (keys and values) for the whole sequence so it doesn't recompute them for each new token. This KV cache grows linearly with context length and can consume more GPU memory than the model weights themselves on long inputs. It's the real reason serving long contexts is expensive, and the reason providers price long prompts the way they do.
Advertised vs effective context. A model that scores perfectly on "needle in a haystack" retrieval — find one planted sentence in a sea of text — can still fail when a task requires combining facts scattered across a long input. Recall is reliably strongest at the beginning and end of the context and weakest in the middle, a phenomenon documented in the "Lost in the Middle" paper and unpacked in our article on it. Practical consequence: put your most important instructions and facts at the start or end of the prompt, never buried in the middle of a document dump.
Training length ≠ inference length. Models are typically pretrained on shorter sequences than the window they ship with, then extended — by scaling their positional encodings (RoPE-based methods like YaRN) and fine-tuning on long documents. This is why long-context quality varies so much between models with identical advertised windows: the extension recipe matters as much as the number on the spec sheet.
Prompt caching changes the economics. Because the system prompt and tool definitions are identical across requests, providers can cache the computed state of that static prefix and skip reprocessing it. The practical design rule: structure prompts with stable content first and variable content last, so the cacheable prefix is as long as possible. On high-traffic apps this routinely cuts both latency and cost by large factors.
Context engineering is becoming its own discipline. Long-running agents can't just append forever — they trim old turns, summarize completed work, offload facts to files or vector stores, and pull them back via retrieval only when needed. The open problem underneath it all: nobody has fully solved persistent memory. Every current approach — bigger windows, summarization, retrieval, memory features — is a workaround for the same fundamental fact you started this article with: the model only knows what's on the whiteboard.
FAQ
Is context length the same as context window?
Yes — the terms are used interchangeably. Both mean the maximum number of tokens the model can process at once, input and output combined. Watch for one wrinkle: some providers also list a separate "max output tokens" limit, which caps a single reply and is much smaller than the full window.
Does the system prompt count toward the context window?
Yes. Every token sent to the model counts: the system prompt, tool and function definitions, the entire conversation history, attached documents, and the reply being generated. There are no free passes — a 2,000-token system prompt costs 2,000 tokens of window on every single request.
Do the model's output tokens count toward the context window?
Yes. The window is shared between input and output. If the window is 128,000 tokens and your input uses 127,000, the reply can only be about 1,000 tokens long before it gets cut off. That's why apps reserve output headroom instead of filling the window with input.
Why does ChatGPT forget the beginning of a long conversation?
Because the model is stateless and the app re-sends your conversation history with every message. Once the history outgrows the context window, the app trims or summarizes the oldest messages to make room. The model never "saw and forgot" — the early messages simply stopped being sent.
Does a bigger context window make the model smarter?
No. A bigger window lets the model see more at once, but it doesn't improve reasoning. Long-context models also use their windows unevenly — recall is strongest at the beginning and end of the input and weakest in the middle — so a giant window stuffed with irrelevant text can actually produce worse answers than a short, focused prompt.
Do images and files count toward the context window?
Yes. Multimodal models convert images into tokens — often several hundred to over a thousand per image depending on resolution — and those tokens occupy window space exactly like text. Uploaded files are either tokenized directly into the context or chunked and retrieved, but whatever the model reads counts.