AI/TLDR

How to Manage Chat History: Trimming, Summarizing, and Sliding Windows

Compare the three standard strategies for keeping long conversations inside the context window, and learn which one fits your app.

INTERMEDIATE11 MIN READUPDATED 2026-06-11

In plain English

Here's the secret every chat app is built on: the model remembers nothing. Each API call is a blank slate. When ChatGPT or Claude appears to "remember" that you mentioned your dog three messages ago, it's because the app silently re-sent the entire conversation — every message from both sides — along with your new one. The model reads the whole transcript from scratch, every single turn.

Imagine talking to a brilliant consultant with total amnesia. Before every reply, an assistant hands them a printed transcript of the conversation so far. Early on this works great. But the transcript grows with every exchange, and the consultant can only read so many pages before your meeting time runs out. Eventually you have three choices: throw away the oldest pages (trimming), only ever carry the most recent few pages (sliding window), or have someone rewrite the old pages as a one-page recap and staple it to the recent pages (summarization).

That's chat history management in a nutshell. It's one of the core jobs of context engineering: deciding which parts of a growing conversation actually make it into the prompt, and what gets cut or compressed. Every production chat product — customer support bots, coding assistants, AI companions — runs one of these three strategies, or a hybrid of them.

Why it matters

An unmanaged conversation fails in three escalating ways, and each one bites a different part of your app:

  • Hard failure. Context windows are finite. Once the transcript plus the model's reply exceeds the limit, the API either rejects the request or cuts the response short. Your chatbot doesn't degrade gracefully — it errors out, usually on your most engaged users, because they're the ones with the longest conversations.
  • Cost creep. You pay per input token, and the whole history is input on every turn. A conversation that's twice as long costs roughly twice as much per message, so total cost grows quadratically with conversation length. Turn 50 of a chat can cost ten times what turn 5 did, for the same one-line question.
  • Quality decay. Even when everything fits, stuffing the window hurts. Models get measurably worse at recalling and using details as context grows — the context rot problem covered in prompt length vs quality. A 100-turn transcript full of small talk actively buries the three facts that matter.

Who should care: anyone shipping a multi-turn experience. A single-shot tool ("summarize this email") never hits this problem. A support bot, tutoring app, agent, or anything users talk to for more than a few minutes hits it fast. And there's no built-in fallback to inherit — a naive implementation just appends messages to a list until something breaks. History management is the difference between a demo and a product.

How it works

Every chat app runs the same loop. Your code keeps an array of messages — a system prompt followed by alternating user and assistant turns. Each new user message is appended, a history policy decides what subset of the array gets sent, the model replies, and the reply is appended too. The policy step is where all three strategies live:

Strategy 1: Trimming

Set a token budget (say, 50K tokens). Before each call, count the tokens in your message array; while it's over budget, drop the oldest user/assistant pair. The system prompt is always exempt — it carries your instructions and must survive every cut. Trimming is reactive: short conversations are sent untouched, and cutting only starts when you'd otherwise overflow. It's the cheapest strategy (zero extra LLM calls) and the most lossy — dropped messages are simply gone, and the model will confidently deny that the early conversation ever happened.

Strategy 2: Sliding window

Always send only the last N turns (or last K tokens), no matter how long the conversation gets. It's trimming with a fixed-size view instead of a triggered one. The win is predictability: every request has roughly the same size, cost, and latency, which makes capacity planning trivial. The loss is the same as trimming — anything that slides out of the window is forgotten — but it happens earlier and constantly, not just near the limit.

Strategy 3: Summarization

When the history gets long, make a separate LLM call: "summarize this conversation, keeping names, decisions, constraints, and open questions." Replace the old messages with that summary (usually injected as a single message near the top), and keep the recent turns verbatim. Now the model still "knows" the user's name from turn 2, even at turn 80. The trade-offs: every compaction costs an extra LLM call and a latency spike, and summaries are lossy in a sneakier way — the summarizer decides what was important, and it will sometimes guess wrong.

In practice, production systems converge on a hybrid: a running summary of everything old, plus the last 5–10 turns verbatim. Recent context stays crisp (exact wording matters for follow-ups like "change that to blue"), while the long tail is compressed instead of deleted. This is exactly the pattern frameworks like LangChain ship as built-in summarization memory, and what chat products do under labels like "compacting conversation."

Choosing a strategy

There is no universally correct answer — the right policy depends on whether old details ever matter again in your app. A quick decision table:

TrimmingSliding windowSummarization
Memory of old turnsNone once droppedNone outside windowGist preserved, details lossy
Extra LLM callsZeroZeroOne per compaction
Cost per turnCapped at budgetFlat and predictableLowest at high turn counts
Implementation effort~20 lines~5 linesPrompt + edge cases
Failure mode"What dog?" amnesiaSame, but constantSummary silently drops a constraint
Best forTools, short tasksHigh-volume, stateless-ish botsCompanions, support, agents

Hands-on: trim and summarize in Python

Both strategies fit in a small file with no framework. The only provider-specific part is the actual model call, stubbed here as call_llm — plug in whichever SDK you use. Note the two non-obvious details: the system prompt is never dropped, and trimming always removes whole user/assistant pairs so the history never starts with an orphaned assistant reply.

chat_history.pypython
def estimate_tokens(text: str) -> int:
    # Rough heuristic: ~4 characters per token for English text.
    # Swap in your provider's tokenizer for exact counts.
    return len(text) // 4


def message_tokens(msg: dict) -> int:
    return estimate_tokens(msg["content"]) + 4  # small per-message overhead


def trim_history(messages: list[dict], budget: int) -> list[dict]:
    """Drop oldest turns until the history fits the token budget.
    messages[0] is the system prompt and is never dropped."""
    system, rest = messages[0], list(messages[1:])
    total = message_tokens(system) + sum(message_tokens(m) for m in rest)
    while rest and total > budget:
        total -= message_tokens(rest.pop(0))
        # Never leave an orphaned assistant reply at the front.
        if rest and rest[0]["role"] == "assistant":
            total -= message_tokens(rest.pop(0))
    return [system] + rest


def summarize_history(messages: list[dict], call_llm,
                      keep_recent: int = 6,
                      budget: int = 8000) -> list[dict]:
    """Fold older turns into a recap; keep recent turns verbatim."""
    system, rest = messages[0], list(messages[1:])
    if sum(message_tokens(m) for m in rest) <= budget:
        return messages  # under budget: send as-is

    old, recent = rest[:-keep_recent], rest[-keep_recent:]
    transcript = "\n".join(f"{m['role']}: {m['content']}" for m in old)
    summary = call_llm(
        "Summarize this conversation so it can replace the transcript. "
        "Keep names, decisions, hard constraints, and open questions. "
        "Be terse.\n\n" + transcript
    )
    recap = {"role": "user",
             "content": "[Summary of earlier conversation]\n" + summary}
    return [system, recap] + recent

Wire either function in right before the API call in your chat loop, and keep the full history in your database regardless — the policy controls what the model sees per request, not what you store. That separation matters later: you can change strategies, re-summarize with a better prompt, or debug "why did it forget X" only if the raw transcript still exists.

Common pitfalls

  • Trimming away the system prompt. A naive "drop oldest message" loop eventually deletes message zero, and your bot's persona, rules, and safety instructions vanish mid-conversation. Always pin it.
  • Splitting a tool-call pair. If your assistant uses tools, a tool call and its result are a matched set. Trim one without the other and most chat APIs reject the request outright. Treat call+result as one atomic unit when cutting.
  • Summarizing away hard constraints. "User is allergic to penicillin" or "budget is $500, firm" must survive every compaction. Tell the summarizer explicitly to preserve constraints — and consider keeping a separate, never-summarized list of pinned facts.
  • Killing your prompt cache. Providers cache the shared prefix of repeated requests and discount it heavily. Trimming one message off the front each turn changes the prefix every time, so nothing ever hits the cache. Counterintuitive result: aggressive trimming can raise your bill. Fix: trim in large chunks, rarely, so the prefix stays stable between compactions.
  • Counting tokens by vibes. A characters-divided-by-four estimate is fine far from the limit but will eventually overflow on code, URLs, or non-English text. Near the budget edge, use your provider's real tokenizer or token-counting endpoint.
  • Letting injected content ride forever. If documents pasted into the conversation stay in history, one 30K-token PDF dominates the window for the rest of the chat. Drop or summarize bulky attachments independently of the chat turns.

Going deeper

Check what your provider does server-side before building. This problem is common enough that it's moving into the API layer. Anthropic, for example, offers server-side compaction — the API summarizes earlier conversation automatically when you approach the limit — plus context editing options like clearing old tool results. Reasoning models add a wrinkle here too: their thinking traces are typically stripped from prior turns automatically, so they don't compound your history growth. A strategy you hand-roll today may be a flag you flip tomorrow; the concepts in this article are what those flags implement.

Hierarchical memory treats context like RAM. The MemGPT paper framed the context window as an operating system's main memory, with the full conversation history paged out to external storage. The model itself decides — via tool calls — what to page in, what to write to its long-term store, and what to evict. This turns history management from a fixed policy in your code into a skill the model exercises, and it's the lineage behind agent memory systems like Letta.

Retrieval-based memory is the other escape hatch. Instead of compressing old turns, embed them and store them in a vector database; on each new message, retrieve the handful of past exchanges most relevant to it and inject only those. Memory becomes RAG over your own transcript. It scales to months of conversation history and recalls verbatim detail that summaries lose — at the price of a retrieval pipeline, and a new failure mode: the relevant memory exists but doesn't get retrieved.

Summarization quality is an eval problem, not a prompt problem. A compaction prompt that works in testing will eventually drop something that mattered. Production systems test for this directly: build a set of long conversations with known critical facts, run compaction, then ask questions that depend on those facts. Track the recall rate like any regression metric, and re-run it whenever you touch the summarizer prompt or change models. Structured summaries help — forcing the recap into fixed fields like facts / decisions / open questions makes omissions visible and diffable.

The open problem is salience. Every strategy above must decide what's important before knowing what the user will ask next. A throwaway remark in turn 3 ("oh, we're based in Germany") can become load-bearing in turn 90 when the conversation turns to shipping or privacy law. Humans handle this with cheap, vast, associative memory; LLM systems approximate it with summaries, retrieval, and pinned-fact lists — each a different bet on what future-you will need. Getting that bet right is most of the craft.

FAQ

How do I keep an LLM conversation going past the context window limit?

You can't send more than the window holds, so you send less: trim the oldest messages, keep only a sliding window of recent turns, or summarize older turns into a short recap and send that plus the recent messages. Most production apps use the hybrid — running summary plus the last 5–10 turns verbatim. The full transcript stays in your database; only the model's per-request view shrinks.

Should I trim chat history or summarize it?

Trim (or use a sliding window) when old turns genuinely stop mattering — quick lookups, one-off Q&A, high-volume bots. Summarize when users expect the bot to remember things from earlier — support tickets, tutoring, companions, agents. Summarization costs an extra LLM call per compaction but preserves the gist; trimming is free but the dropped turns are simply gone.

Does trimming chat history break prompt caching?

It can, badly. Prompt caches match on the shared prefix of consecutive requests. If you shave one message off the front of the history every turn, the prefix changes every time and you get zero cache hits — so a strategy meant to cut costs can increase them. The fix is to compact rarely and in big chunks: let the history grow untouched (cache-friendly), then cut a large slice at once when you near the budget.

How many recent messages should I keep verbatim when summarizing?

Five to ten turns is the common starting range. The recent tail must be verbatim because follow-ups depend on exact wording — "make it shorter" or "change that to blue" are meaningless against a summary. Keep enough turns that pronouns resolve, but not so many that the summary never kicks in; tune by replaying real conversations.

Why does my chatbot forget the user's name halfway through a conversation?

Almost always because the message where the name was mentioned got trimmed out of the window, or a summarization pass dropped it. Models don't gradually forget — a fact is either in the context or it isn't. Fixes: instruct your summarizer to always preserve names and constraints, or maintain a small pinned-facts block (name, preferences, hard constraints) that is injected every turn and never subject to trimming.

Further reading