In plain English
A bare LLM has no memory at all. Every time you call it, the model wakes up fresh — no idea who you are, what you discussed yesterday, or that you spent an hour last week debugging the exact same problem together. That forgetfulness is fine for a one-shot Q&A, but it makes a mess of anything longer.
AI agents fix this by giving the model two kinds of memory — and the distinction maps almost perfectly onto human short-term and long-term memory.
Short-term memory is the context window: the text that is literally pasted in front of the model at the moment it generates its next reply. Everything in the current conversation — your messages, the agent's replies, tool outputs, and the system prompt — lives here. It is fast, accurate, and completely invisible to any future session. When the conversation ends, it is gone.
Long-term memory is information written to external storage and retrieved later. A vector database, a SQL table, a plain JSON file on disk — the format varies, but the idea is the same: the agent saves something important now, so it can pull it back later, in a brand-new session where the context window has been wiped clean.
Why it matters
The moment a user has a second conversation with your agent, short-term memory alone is not enough. The agent has forgotten everything: the user's name, their preferences, the task you left half-finished, the fact that the API key they use is stored in their vault. Without long-term memory, every session starts from zero — users repeat themselves, agents re-derive context, and what could have been a relationship becomes a cycle of amnesia.
What goes wrong when memory is missing
- No short-term management — the context window fills up with stale turns, token costs spike, and the model starts dropping earlier information (the "lost in the middle" problem where things buried in a long context get ignored).
- No long-term memory — the agent asks for the same context on every session. Users lose patience; continuity is impossible; personalisation never improves.
- Long-term memory with no retrieval strategy — the agent dumps everything it ever saved into the prompt on every turn. The context window bloats, irrelevant history crowds out useful information, and costs skyrocket.
- Long-term memory with no write policy — the agent never decides what is worth saving, so nothing accumulates, or it saves everything and the store becomes noise.
Getting the boundary right — what stays in-context, what gets written to long-term storage, and what gets retrieved and when — is one of the most important design decisions in building a production agent.
How agent memory works
At the mechanical level, the two memory types work completely differently. Short-term memory is the prompt — no retrieval step, no database call. Long-term memory requires three distinct operations: write, index, and retrieve.
Short-term memory: the context window
Every token passed to the model at inference time is short-term memory. This includes the system prompt, the full conversation history, any documents you pasted in, and every tool result the agent received. The model has perfect access to all of it — no retrieval needed — but it is strictly bounded. Modern frontier models support context windows ranging from 128,000 tokens (Claude 3.5) to 200,000 tokens (Claude Sonnet 4.6) to 1 million tokens in beta for some models. Large as those numbers are, they still have a ceiling, and they reset completely between sessions.
Long-term memory: external storage + retrieval
Long-term memory lives outside the model entirely — in a database the agent reads and writes through tool calls. The most common pattern uses a vector database (Pinecone, Weaviate, pgvector, Chroma): the agent converts text into a numerical embedding, stores it, and later retrieves the most semantically similar entries for a given query. This is the same plumbing that powers RAG, applied to the agent's own accumulated knowledge rather than external documents.
At the start of each session, the agent searches long-term memory for entries relevant to the current task and injects the top results into the system prompt. From the model's perspective, those retrieved memories look like short-term memory — they are just text in the context window. The difference is that they were fetched from persistence, not carried over from the last session.
- Lifetime: current session only
- Storage: model's active context
- Retrieval: automatic — always visible
- Capacity: 128k–1M+ tokens (model-dependent)
- Cost: charged per token, every call
- Best for: active task state, recent turns, tool outputs
- Lifetime: persists across sessions indefinitely
- Storage: vector DB, SQL, key-value store, etc.
- Retrieval: explicit search call (similarity or keyword)
- Capacity: effectively unlimited
- Cost: storage + retrieval query, not per-token
- Best for: user facts, preferences, past conversations, domain knowledge
What agents actually save to long-term memory
Not everything in a session is worth persisting. The agent (or a background process) must make a judgment call. Here are the three most common categories of things worth writing to long-term memory.
User facts and preferences
The most immediately useful long-term memories are personal: the user's name, their time zone, their preferred output format ("I like numbered lists, not prose"), their area of expertise, and recurring context like which project they are working on. These facts change slowly and pay off on every subsequent session.
Session summaries (episodic memory)
At the end of a session, the agent (or a separate background process) compresses what happened into a short summary: what was asked, what was decided, what was left unfinished. These compressed summaries are far cheaper to store and retrieve than raw transcripts, and they give the agent enough context to resume naturally in the next session.
Domain knowledge and facts (semantic memory)
The agent can accumulate timeless knowledge over time: API schemas it looked up, a company's internal terminology, product specs, rules it discovered by trial and error. Unlike session summaries, these facts are not tied to a particular conversation — they are just true, and they can be retrieved any time a related query arrives.
# Minimal sketch: saving and retrieving a user preference
import json
from pathlib import Path
MEMORY_FILE = Path("user_memory.json")
def save_memory(key: str, value: str) -> None:
"""Write a fact to long-term memory."""
store = json.loads(MEMORY_FILE.read_text()) if MEMORY_FILE.exists() else {}
store[key] = value
MEMORY_FILE.write_text(json.dumps(store, indent=2))
def recall_memory(key: str) -> str | None:
"""Retrieve a fact from long-term memory."""
if not MEMORY_FILE.exists():
return None
return json.loads(MEMORY_FILE.read_text()).get(key)
# Session 1: learn the user's preference
save_memory("output_format", "numbered lists, not prose")
save_memory("timezone", "Europe/Amsterdam")
# Session 2 (new context window): inject saved facts into the system prompt
format_pref = recall_memory("output_format")
tz = recall_memory("timezone")
system_prompt = f"""You are a helpful assistant.
User preferences: format={format_pref}, timezone={tz}"""
# The LLM now 'remembers' across sessions — even though the context window reset.Common pitfalls
Agent memory is one of the areas where small design mistakes compound into large production headaches. Here are the mistakes teams hit most often.
Stuffing everything into context
The easiest approach — and the one that fails first in production — is to append the entire conversation history plus every past memory to every prompt. It works in demos. It collapses under real usage: token costs multiply, the model loses track of older information buried in the middle of a long context, and latency grows. A retrieval step that selects only the relevant memories is not optional at scale.
Saving too much (or nothing at all)
Agents that save every single utterance produce long-term stores full of noise — "the user said hi", "the user asked about the weather", "the user asked about the weather again". Retrieving from a noisy store degrades answer quality. Agents that save nothing gain nothing. A write policy that targets durable, high-value facts — preferences, decisions, unfinished tasks — keeps the store clean.
Stale memories
Long-term memories age. A fact that was true in January ("I prefer dark mode", "the rate limit is 60 RPM") may be outdated by June. Without expiry dates or a periodic review pass, the agent will confidently repeat stale information. Simple mitigation: store a written_at timestamp with every memory and prefer recent entries when two memories conflict.
Treating memory as a magic fix for context-window limits
Long-term memory is not a lossless extension of the context window. Retrieval is probabilistic — the most semantically similar entry is not always the most useful one. Critical information that the agent absolutely must have (the system rules, the current task description) belongs in the context window directly, not in an external store where a bad retrieval query might miss it.
Going deeper
Once you have the short-term/long-term split working, a second tier of questions opens up: how do you decide when to write, how do you handle conflicts, and how do the two tiers interact in real frameworks?
The four memory sub-types
Researchers and practitioners further divide long-term memory into three sub-types — episodic (past events), semantic (timeless facts), and procedural (skills and workflows) — plus in-context working memory. Each sub-type has different write triggers, retrieval patterns, and staleness characteristics. If the short-term/long-term split is the foundation, understanding these four types is the next step.
Hot-path vs background writes
Memory can be written in the hot path (the agent explicitly calls a save_memory tool before replying — what ChatGPT's memory feature does) or in the background (a separate process reads the session transcript and extracts memories after the conversation ends, with no latency cost to the user). Hot-path writes ensure the agent has the memory immediately; background writes keep response latency low. Most production systems combine both: user preference corrections are saved immediately, while session summaries are extracted in the background.
Memory frameworks
Several open-source libraries handle the write/index/retrieve cycle for you. Mem0 automatically extracts and stores all four memory sub-types from agent conversations — it decides what to save and where, and exposes a simple add / search API. LangMem (part of LangChain) integrates with LangGraph agents and supports both hot-path and background writes. Letta (formerly MemGPT) uses a three-tier model inspired by OS memory management: core memory always in-context (like RAM), archival memory in an external vector store (like disk), and recall memory from conversation history — agents explicitly call memory management functions to move information between tiers.
Memory in multi-agent systems
When multiple agents collaborate on a task, a new question arises: which memories are private to one agent and which are shared across the team? A common pattern gives each worker agent its own episodic memory (its own conversation history) while sharing a single semantic memory store (common facts, the project knowledge base) across the whole team. The orchestrator agent typically manages writes to the shared store to prevent conflicts. This mirrors how a human team shares a wiki but keeps personal notes private.
KV cache: the implicit short-term memory layer
Modern LLM inference servers use a KV cache to avoid recomputing attention over repeated prompt prefixes. If your system prompt is identical across all users — say, a large product manual or shared instructions — caching it once means you only pay the full token cost on the first call; subsequent calls reuse the cached computation. This is not "memory" in the agent sense, but it is a meaningful latency and cost lever when combined with a large static knowledge base injected into the system prompt.
FAQ
What is short-term memory in an AI agent?
Short-term memory is the agent's context window — all the text currently passed to the model: the system prompt, conversation history, tool results, and any documents pasted in. It is always visible to the model without a retrieval step, but it resets to empty at the start of every new session. Anything you want to survive beyond the current conversation must be written to long-term storage.
What is long-term memory in an AI agent?
Long-term memory is information stored in an external database (most commonly a vector database) that the agent can write to and read from across multiple sessions. At the start of a session the agent retrieves relevant entries and injects them into the context window. Unlike short-term memory it is not cleared between sessions, so it enables personalisation, continuity, and accumulated knowledge.
How do AI agents remember things between conversations?
By saving important information to an external store at the end of a session — user preferences, task summaries, key facts — and then retrieving the most relevant entries at the start of the next session. Those retrieved memories are injected into the system prompt, so the model reads them as part of its context window. From the model's perspective, it looks like the information was always there.
Is the context window really "memory"?
It functions like memory in the sense that the model can reference any information it contains. But it is more like a whiteboard than memory: perfect recall of everything currently on it, but wiped completely when the session ends. True cross-session memory requires writing to an external store. The context window is where retrieved memories land, not where they live.
Do larger context windows make long-term memory unnecessary?
Not quite. Even a 1-million-token context window has three limitations: it resets between sessions (so it cannot replace persistent storage), it costs money per token every single call (so keeping large amounts of stale history is expensive), and research shows models can struggle with information buried in the middle of very long contexts. Long-term memory with selective retrieval is still the right architecture for persistent, personalised agents.
What is a vector database and why do agents use it for memory?
A vector database stores text as numerical embeddings and retrieves entries by semantic similarity — finding the most relevant memories even if the exact words don't match. For example, a query like "what does the user prefer?" can retrieve a stored fact about output formatting even if it was saved under a different phrasing. Popular options include Pinecone, Weaviate, Chroma, and pgvector (a PostgreSQL extension).