AI/TLDR

How to Build Multi-Turn Conversations with a Stateless API

Learn why the API forgets everything between calls and how real chat apps fake memory by resending and trimming history.

BEGINNER12 MIN READUPDATED 2026-06-12

In plain English

Every time you call an LLM API, the model starts completely blank. There is no session, no running process keeping your conversation alive between calls — each request is processed from scratch, exactly like the first one you ever sent. This is what engineers mean when they say the API is stateless: there is no state being held for you on the server between requests.

The 'Internet of Things' communication network in the ocean
The 'Internet of Things' communication network in the ocean — Patrizio Mariani, Ralf Bachmayer, Sokol Kosta, Ermanno Pietr

The analogy that makes this click: imagine a very talented consultant who reads incredibly fast, but suffers complete amnesia the moment you hang up the phone. Every time you call back, you have to recap the whole conversation from the beginning before asking your next question. That is precisely what your app has to do with an LLM API. The 'memory' you experience in polished chat products is not magic — it is the application recapping the entire conversation transcript at the start of every single API call.

Why it matters for builders

If you skip history management, every message your user sends gets a response that ignores everything they said before. The model has no idea it already introduced itself, no idea the user said they prefer Python examples, no idea what question it just answered. The result is a frustrating, incoherent chat experience.

Get it right, and you have a fully functional conversational app with no server-side state to maintain. Get it wrong in the other direction — and send too much history — and you hit two concrete problems: cost and context limits. Because every turn re-sends all previous turns as input tokens, a ten-turn conversation can cost dramatically more than ten times the first message. And every model has a hard limit on total tokens it will accept in a single request; exceed it and the API returns an error.

  • No history at all: model is incoherent, forgets everything between messages
  • Too much history: hits the model's context window limit, request fails with an error
  • Unbounded history growth: token costs compound with every turn, expensive at scale
  • Right-sized history: coherent conversation, predictable costs, requests always succeed

How it works: the messages array

Both major APIs — OpenAI's Chat Completions and Anthropic's Messages — represent a conversation as an array of message objects, where each object has a role and content. The roles are user (what the human said) and assistant (what the model replied). Some APIs also support a system role for a persistent instruction that frames the whole conversation.

To build a multi-turn conversation, your application maintains this array in memory (or a database), appends each new turn, and sends the entire array with every API call. The model reads the full transcript top-to-bottom before generating its reply at the bottom — like reading a chat log and then responding.

Here is what this looks like in practice with the OpenAI SDK in Python. The messages list starts with a system instruction, then grows with each turn:

pythonpython
from openai import OpenAI

client = OpenAI()  # reads OPENAI_API_KEY from env

# Seed with a system message (optional but common)
messages = [
    {"role": "system", "content": "You are a helpful assistant."}
]

while True:
    user_input = input("You: ")
    if user_input.lower() in ("quit", "exit"):
        break

    # 1. Append the new user turn
    messages.append({"role": "user", "content": user_input})

    # 2. Send the FULL messages array
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages
    )

    assistant_reply = response.choices[0].message.content
    print(f"Assistant: {assistant_reply}")

    # 3. Append the assistant reply so next turn has context
    messages.append({"role": "assistant", "content": assistant_reply})

The Anthropic Messages API works identically — same alternating user/assistant pattern, same principle of resending the full array each call:

pythonpython
import anthropic

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from env

messages = []  # Anthropic uses a top-level system param instead

while True:
    user_input = input("You: ")
    if user_input.lower() in ("quit", "exit"):
        break

    messages.append({"role": "user", "content": user_input})

    response = client.messages.create(
        model="claude-opus-4-8",
        max_tokens=1024,
        system="You are a helpful assistant.",
        messages=messages  # full history every time
    )

    assistant_reply = response.content[0].text
    print(f"Assistant: {assistant_reply}")

    messages.append({"role": "assistant", "content": assistant_reply})

The token cost reality

Tokens are the currency you pay with — roughly three to four characters each. Every API request bills you for all input tokens (the entire messages array you send) plus the output tokens the model generates. Because you re-send the full history each turn, token costs grow with every message exchanged.

TurnInput tokens sentWhat is included
1~50System prompt + first user message
3~300System + turns 1-2 + new message
10~2 000System + turns 1-9 + new message
20~6 000System + turns 1-19 + new message
50~20 000All prior turns accumulate

Turn 50 in the table above sends roughly 400 times the tokens of turn 1. With GPT-4o at roughly $2.50 per million input tokens, a 50-turn conversation where each message is a few sentences can cost $0.05 or more — which sounds trivial until you multiply it by thousands of daily users. This is why production apps almost always trim history rather than sending it all indefinitely.

Strategies for trimming conversation history

Once you accept that you cannot send unlimited history forever, there are three practical strategies, each with different tradeoffs. Most production apps start with sliding window and graduate to summarization as the product matures.

Strategy 1: Sliding window (keep last N messages)

The simplest approach: keep only the most recent N messages (N is typically 10-20). Before each API call, slice the array to drop older turns. Fast, cheap, zero extra API calls.

pythonpython
MAX_HISTORY = 20  # keep last 20 messages (10 turns)

def get_trimmed_messages(messages: list) -> list:
    """Return only the most recent MAX_HISTORY messages."""
    return messages[-MAX_HISTORY:]

# Usage in the chat loop:
response = client.chat.completions.create(
    model="gpt-4o",
    messages=get_trimmed_messages(messages)  # trim before sending
)

Strategy 2: Token-budget window

Rather than capping by message count, cap by token count. This is more precise because message lengths vary wildly — one user turn might be 5 tokens, another might be 500. Walk backwards through the history, counting tokens, and stop when you reach your budget.

pythonpython
# Requires: pip install tiktoken  (for OpenAI models)
import tiktoken

MAX_INPUT_TOKENS = 60_000  # leave headroom for system + output

def trim_by_tokens(messages: list, model: str = "gpt-4o") -> list:
    enc = tiktoken.encoding_for_model(model)

    def count(msg: dict) -> int:
        return len(enc.encode(msg["content"])) + 4  # ~4 tokens overhead

    kept = []
    total = 0
    for msg in reversed(messages):  # walk newest-first
        t = count(msg)
        if total + t > MAX_INPUT_TOKENS:
            break
        kept.insert(0, msg)
        total += t

    return kept

Strategy 3: Summarize old turns, keep recent ones verbatim

When context continuity matters more than cost, keep the last 10 or so messages in full, but replace older ones with a compact AI-generated summary. You call the LLM to summarize the old chunk, store the summary as a single assistant or system message, and discard the raw turns. Research from mem0 suggests this hybrid approach can cut token costs 80-90% while preserving most conversational coherence.

pythonpython
def summarize_old_turns(
    messages: list,
    keep_recent: int = 10,
    model: str = "gpt-4o-mini"
) -> list:
    """Summarize messages older than `keep_recent` into one block."""
    if len(messages) <= keep_recent:
        return messages

    old_turns = messages[:-keep_recent]
    recent_turns = messages[-keep_recent:]

    # Ask a cheap model to summarize the old chunk
    summary_resp = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "Summarize this conversation concisely, preserving key facts, decisions, and user preferences."},
            {"role": "user", "content": str(old_turns)}
        ]
    )
    summary_text = summary_resp.choices[0].message.content

    summary_message = {
        "role": "assistant",
        "content": f"[Earlier conversation summary: {summary_text}]"
    }

    return [summary_message] + recent_turns
StrategyCostContext qualityComplexityBest for
Sliding window (last N)LowGood for recent topics, poor for early contextTrivialMost chat apps
Token-budget windowLowSame as sliding window but more accurateLowVariable-length messages
Summarize + keep recentMedium (extra call)Good overall, detail lost in old turnsMediumLong sessions, support bots
Vector retrieval (RAG)Medium-HighExcellent — retrieves relevant older contextHighKnowledge-heavy agents

Going deeper

Once your basic multi-turn loop works, several advanced patterns become relevant at production scale.

Persisting conversation history

An in-memory messages list disappears when the process restarts. Production apps store the history in a database (Postgres, Redis, DynamoDB — any will do) keyed by a conversation_id. Fetch it at the start of each HTTP request, append the new turn, save it back. This decouples your web servers from conversation state entirely.

Stateful API wrappers

OpenAI's newer Responses API (introduced in 2025) supports a previous_response_id parameter that lets the API server chain responses without you re-sending the full history. Anthropic and other providers are adding similar server-side state features. These reduce bandwidth and simplify client code, but you trade visibility and portability — you can no longer inspect or edit history without additional API calls, and you are locked into that provider's session storage.

Prompt caching

Both Anthropic and OpenAI offer prompt caching: if the prefix of your messages array matches a recent cached prefix on their servers, input tokens in the cached portion are billed at a significantly lower rate (Anthropic charges 10% of normal price for cache hits). Long system prompts and stable conversation history prefixes are ideal candidates. The implication for history management: try to keep the start of your messages array stable across turns, and only append to the end.

Vector-based long-term memory

For agents that need memory across many sessions — not just within one conversation — embedding past turns and retrieving semantically relevant ones at query time is the standard approach. Libraries like mem0 and LangChain's memory modules implement this. Instead of sending all history, you send only the most relevant past context. This scales to arbitrarily long histories but adds retrieval latency and requires a vector store.

Message roles and system prompts

The system role (or Anthropic's top-level system parameter) is not just a nice-to-have — it is the most token-efficient place to put instructions that persist through the whole conversation, because it appears once and is not repeated per turn. Keep per-conversation facts (user name, preferences, current task) in the system prompt rather than injecting them as user messages; this keeps the turn history clean and easier to trim.

FAQ

Why does the LLM API not remember my previous messages automatically?

LLM APIs are stateless by design — each HTTP request is processed independently with no server-side session kept between calls. This makes the infrastructure simple to scale, but it means your application is responsible for maintaining and resending the conversation history. Every major provider (OpenAI, Anthropic, Google) follows this model for the standard API, though some now offer optional stateful wrappers on top.

What happens if I send too many messages and exceed the context window?

The API returns an error — typically a 400 Bad Request or a specific context-length error code. Your application must handle this. The fix is to trim older messages from the history before the next call, using a sliding window, token-budget cut, or summarization strategy. It is good practice to check token counts proactively so you never hit the limit mid-conversation.

Do I need to resend the entire conversation history every single time?

Yes, with the standard Chat Completions or Messages APIs. Every call is independent, so you must include all the context the model needs. OpenAI's newer Responses API lets you chain calls via previous_response_id so the server retains history, but you are still paying for those tokens — the history still exists, you just do not have to transmit it yourself.

How many tokens does a typical conversation consume?

It depends on message length, but a rough estimate: each back-and-forth turn of a few sentences adds 100-300 tokens. A 20-turn conversation might accumulate 3 000-6 000 input tokens per call by the end. With GPT-4o at about $2.50 per million input tokens, that is fractions of a cent per request — but it compounds quickly at scale. Monitoring token usage from the API response object (usage.prompt_tokens) is the most accurate way to track real consumption.

What is the best way to trim history without losing important context?

For most apps, keeping the last 10-20 messages (sliding window) is sufficient — recent context is almost always the most relevant. For longer sessions, a hybrid approach works well: keep the last 10 messages verbatim and replace everything older with a one-paragraph summary generated by a cheap model. Pure summarization loses specific details (dates, names, code snippets), so always preserve a recent verbatim window.

Should I store conversation history in my database or rely on the API?

Store it in your own database. An in-memory messages list is lost when your server restarts. Persisting history to Postgres, Redis, or any other store keyed by a conversation_id lets you resume conversations across server restarts, scale horizontally across multiple servers, inspect or edit history for debugging, and apply your own retention or compliance policies. The API itself has no memory of past calls.

Further reading