In plain English
Every language model has a context window — the total amount of text it can hold in its head at once, measured in tokens. That window has to fit everything: your system prompt, every earlier message in the conversation, any documents you pasted, and the reply the model is about to write. When the sum of all that crosses the limit, you've exceeded the context window.
Picture a whiteboard with a fixed amount of space. You can keep writing notes on it for a while, but the moment it's full you have two choices: stop writing, or erase the oldest notes to make room. A model facing a full context window does roughly one of those two things — and which one depends entirely on the tool or API you're using. There is no single universal behavior, which is exactly why this trips people up.
Why it matters
If you've ever had a long chat with an assistant and watched it suddenly forget an instruction you gave it ten minutes ago, you've already met this problem. The model didn't get lazy — the earliest part of the conversation got pushed out of the window to make room for newer text. This is the single most common cause of "why does it keep forgetting what I told it?"
For anyone building on top of an API, the stakes are higher than a confused chat. Depending on the provider, overflowing the window can throw a hard 400 error that crashes your request, silently drop the beginning of your prompt (including the instructions you most cared about), or cut off the reply halfway through a sentence. If you don't know which of these your stack does, you'll find out from an angry user instead of from your own logs.
- Chat users hit it as creeping amnesia in long sessions.
- App developers hit it as runtime errors or truncated outputs that pass tests but fail in production with real, long inputs.
- RAG and agent builders hit it constantly, because retrieval and tool results pump large chunks of text into the window on every turn.
How it works
The model itself has no clock or counter ticking down. The limit is a hard architectural ceiling: the model was trained to process sequences only up to a certain length. So the handling of an overflow happens in the layer around the model — the API server or the chat product — and that layer picks one of four strategies.
1. Hard error (most raw APIs)
The strictest behavior. If your input alone — plus the room you reserved for the reply — exceeds the limit, the API refuses the whole request before generating anything. OpenAI's API does this: you get an HTTP 400 with code context_length_exceeded and a message that literally tells you how many tokens you sent versus the cap. Nothing is silently dropped; your code has to catch the error and shrink the request.
2. Truncate the input (some wrappers and chat UIs)
Instead of erroring, the layer quietly chops text off the start of the conversation until it fits, then sends it. This is convenient — the request always succeeds — but dangerous, because the dropped text is usually your system prompt or earliest instructions, and the model never tells you they're gone. The reply looks confident and complete while missing half the context.
3. Sliding window / FIFO (most chat products)
A continuous version of truncation. As new turns arrive, the oldest turns are evicted first-in-first-out, so the window is always full of recent text. This is why a long chat "forgets" the beginning: those tokens were rolled off the back. Anthropic's docs describe chat interfaces as optionally using exactly this rolling FIFO behavior.
4. Stop generating mid-reply (modern Claude behavior)
Here the request is accepted even if input + max_tokens could overrun the window. The model writes a normal reply, and if generation actually reaches the ceiling it stops and reports why. On Claude 4.5 models and newer, this surfaces as stop_reason: "model_context_window_exceeded" instead of a crash — a successful HTTP 200 response that happens to be cut short. Your code checks the stop reason rather than a try/except.
Errors vs. stop reasons: a critical distinction
Two very different things can happen, and conflating them is the number-one source of brittle code. An error means the request failed: you get a 4xx/5xx HTTP status and no useful content. A stop reason means the request succeeded but generation ended early — you get a 200, valid (if partial) content, and a field explaining why it stopped.
| Situation | What you get | How to detect it |
|---|---|---|
| Input too big up front (OpenAI-style) | HTTP 400, no content | Catch the API exception; read error code |
| Reply hit the window ceiling (Claude 4.5+) | HTTP 200, partial text | Check stop_reason == "model_context_window_exceeded" |
Reply hit your own max_tokens cap | HTTP 200, partial text | Check stop_reason == "max_tokens" |
| Input silently truncated by a wrapper | HTTP 200, full-looking text | Hard to detect — count tokens yourself |
That last row is the scary one. A silent truncation produces a response that looks totally fine, so your tests pass and your monitoring stays green while users get answers that ignored their instructions. The defense is to count tokens before you send rather than trusting the layer to behave.
- HTTP 4xx / 5xx
- No usable content
- Handle with try / except
- e.g. OpenAI context_length_exceeded
- HTTP 200
- Valid but partial content
- Handle by reading stop_reason
- e.g. model_context_window_exceeded
Handling it in code
The robust pattern has two layers of defense. First, estimate your token count before sending so you never blindly overrun the window. Second, still check the response, because estimates and reserved output room are never perfectly exact. Anthropic exposes a free token-counting endpoint for the pre-flight check, and the SDK returns a stop_reason for the post-flight one.
from anthropic import Anthropic
client = Anthropic()
MODEL = "claude-opus-4-8" # 1M-token window as of mid-2026
WINDOW = 1_000_000
MAX_OUTPUT = 8_000 # room we reserve for the reply
def ask(messages):
# 1) Pre-flight: count input tokens (this call is free, no compute)
counted = client.messages.count_tokens(model=MODEL, messages=messages)
if counted.input_tokens + MAX_OUTPUT > WINDOW:
messages = trim_oldest(messages) # your shrink strategy
# 2) Send, then post-flight: check WHY it stopped
resp = client.messages.create(
model=MODEL, max_tokens=MAX_OUTPUT, messages=messages,
)
if resp.stop_reason == "model_context_window_exceeded":
# 200 OK, but the reply was cut short by the window
warn_user("Reply truncated — conversation is near its limit.")
return resptrim_oldest is where you decide your overflow policy instead of letting the plumbing decide for you. The two workhorse strategies are a sliding window (keep the system prompt plus the last N turns, drop the rest) and summarization (compress old turns into a short recap and prepend it). Most production chat apps use a hybrid: keep the last ~10-15 turns verbatim, and a running summary for everything older.
Current landscape (mid-2026)
Windows have grown enormously, which changes when you hit the wall but not whether you do. As of mid-2026, the headline numbers among the major families look roughly like this — though specifics churn, so confirm against the provider's own model page before you rely on a number.
| Family (mid-2026) | Typical context window | Overflow behavior |
|---|---|---|
| Claude Opus 4.8 / Sonnet 4.6 | 1M tokens | model_context_window_exceeded stop reason (4.5+) |
| Gemini (2.5 / 3.x Pro) | 1M+ tokens | Hard request error when exceeded |
| GPT-5 family | Large (hundreds of K) | context_length_exceeded 400 error |
| Most chat UIs | Same as the API model | Sliding FIFO eviction of old turns |
Two newer wrinkles are worth knowing. First, providers now ship server-side compaction — the API summarizes earlier turns for you automatically, so long sessions can run past the raw limit with little integration work. Anthropic offers this in beta across its current models. Second, some Claude models added context awareness: the model receives a running token budget (for example, a <budget:token_budget>1000000</budget:token_budget> tag) so it can pace a long task instead of blindly running out of room. These features reduce, but don't eliminate, the need to handle overflow yourself.
Going deeper
Why is the limit hard at all? Because position is baked into the model. A transformer's attention mechanism uses positional encodings to know where each token sits in the sequence, and the model was only trained on positions up to its window size. Feed it a position it never saw during training and behavior degrades sharply — which is why providers enforce the ceiling rather than letting you quietly overrun it.
This also explains why simply truncating the middle of a long input is risky. Models attend most strongly to the beginning and the end of their context and weakest to the middle — the well-documented "lost in the middle" effect. So even a request that technically fits can behave as if it overflowed, with crucial buried details effectively ignored. Overflow handling and retrieval quality are two sides of the same coin.
Smarter eviction than plain FIFO
Plain first-in-first-out eviction is crude — it can drop a pinned instruction just because it's old. Better systems treat context as a managed resource: pin the system prompt and any must-keep facts so they're never evicted, score the rest by relevance to the current turn, and evict low-value chunks first. This is the heart of context engineering and overlaps heavily with how agents manage long-running memory across many tool calls.
For genuinely unbounded history — say, an assistant that should recall something you said last week — no window is big enough. The standard escape hatch is to push old context out of the window into an external store (a vector database) and retrieve only the relevant slivers back in per turn. That converts an impossible "remember everything" problem into a tractable "retrieve the right thing" problem — and it's why overflow handling, summarization, and RAG keep showing up together in production stacks.
FAQ
What actually happens when a context window is full?
One of four things, depending on your tool. A raw API may reject the request with a hard error (OpenAI's context_length_exceeded 400). A wrapper or chat UI may silently truncate the oldest text to make it fit. A chat product typically uses a sliding window that evicts the oldest turns first. Or a modern model (Claude 4.5+) accepts the request and stops mid-reply with stop_reason: "model_context_window_exceeded". It is a property of your plumbing, not the model itself.
Why does ChatGPT forget what I said earlier in a long conversation?
Because chat products use a sliding window. As you add new messages, the oldest ones are pushed out first-in-first-out to keep the window full of recent text. Once your earliest instructions roll off the back, the model literally cannot see them anymore, so it behaves as if it forgot. It is not a memory bug — it is the window doing exactly what it is designed to do.
Will I get an error or just a truncated response if I exceed the limit?
It depends on the provider. OpenAI throws a hard 400 error and generates nothing, so you must catch the exception. Claude 4.5 and newer models instead accept the request and return a normal HTTP 200 with a partial reply and stop_reason: "model_context_window_exceeded", so you check the stop reason rather than catching an error. Some wrappers do neither and silently drop input — the most dangerous case, because the response looks complete.
How do I prevent context window overflow before it happens?
Count your tokens before sending. Anthropic offers a free token-counting endpoint, and OpenAI-compatible stacks use libraries like tiktoken. If input plus your reserved output room exceeds the window, shrink the request first — keep the system prompt and the last N turns, summarize older turns, or retrieve only relevant chunks instead of dumping everything in. Then still check the response's stop reason as a backstop.
Does a bigger context window fix the forgetting problem?
It pushes the wall further out but does not remove it, and it introduces a new one. As you fill even a million-token window, accuracy and recall degrade — an effect called context rot — and models pay less attention to the middle of long inputs. A long conversation will still eventually overflow, and a stuffed-but-not-overflowing window can already behave as if details are missing. Curating what goes in matters more than raw size.