In plain English
When you ask an LLM a question, the model doesn't think up the whole answer and then hand it to you all at once. It writes one piece at a time, left to right. Streaming just means the API sends you each piece the moment it's ready, instead of waiting until the entire answer is finished.
Picture a friend reading you a recipe over the phone. The non-streaming friend reads the whole recipe silently to themselves first, then calls you back five minutes later and dictates it. The streaming friend just starts talking the second they see the first line, and keeps going as they read. Same recipe, same words — but you start hearing it immediately and you can tell they're still working.
That word-at-a-time flow is what produces the now-familiar "typewriter" effect in ChatGPT, Claude, and almost every AI chat box you've used. The text crawls onto the screen because it is literally arriving in tiny pieces, one token at a time. A token is roughly a word-fragment — about four characters of English — so a single sentence is a handful of streamed chunks.
Why it matters
LLMs are slow by web standards. A long answer can take 10–30 seconds to finish generating. If you wait for the whole thing before showing anything, the user stares at a spinner for half a minute — and many of them assume your app is broken and leave.
The single number that matters here is time to first token (TTFT) — how long until the very first chunk shows up. With streaming, TTFT is usually well under a second, even if the full answer takes 20 seconds. The user sees motion almost instantly, reads along as it generates, and the perceived speed is dramatically better even though the total time is identical.
- Send request
- Blank screen + spinner
- ...wait 20 seconds...
- Entire answer pops in at once
- User wondered if it crashed
- Send request
- First words appear in ~0.5s
- Text flows in as it's written
- User reads along
- Feels fast and alive
Who should care? Anyone building a user-facing AI feature — a chatbot, a writing assistant, a coding tool, a support agent. Streaming is the default UX expectation now; an app that doesn't stream feels broken next to one that does. It's a core part of good AI UX patterns.
Streaming also gives you an early cancel button. If the user sees the answer going the wrong way three seconds in, they can stop it — and you stop paying for the tokens you didn't generate. Before streaming, you had no way to interrupt; you paid for the full response no matter what.
How it works
Under the hood, streaming rides on a web standard called Server-Sent Events (SSE). SSE is a simple, one-directional channel: the server holds the HTTP connection open and keeps pushing little text messages down it until it's done. It's been part of the HTML standard for years — LLM APIs just adopted it as the transport for token streams.
You opt in by setting "stream": true in your request. The server then responds with the content type text/event-stream and, instead of one big JSON blob, sends a sequence of small events. Each event is plain text shaped like this:
event: content_block_delta
data: {"type": "content_block_delta", "index": 0, "delta": {"type": "text_delta", "text": "Hello"}}
event: content_block_delta
data: {"type": "content_block_delta", "index": 0, "delta": {"type": "text_delta", "text": "!"}}
Notice the shape: a line starting with event: names the kind of event, a line starting with data: carries a JSON payload, and a blank line separates one event from the next. That blank-line boundary is the whole protocol. Your job is to read the connection line by line, group lines into events, parse the JSON, and pull out the new text fragment (the delta).
A full response isn't just text deltas. The stream is a small life-cycle of named events. On the Anthropic Claude API, for example, you get a message_start, then a content_block_start, then a burst of content_block_delta events (the actual words), then message_delta and message_stop to close it out. There are also occasional ping events that carry no content — they just keep the connection from timing out. The diagram below shows the order.
The other big providers follow the same SSE pattern with their own event names — OpenAI sends data: chunks ending in a literal data: [DONE] marker, and Google's Gemini API streams JSON objects the same way. The names differ; the idea is identical. And here's the good news: you almost never parse this by hand.
Consuming a stream in code
In practice you let the SDK do the heavy lifting and just iterate. Here's the backend pattern in Python — open a stream and print each text fragment as it arrives:
import anthropic
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from the env
with client.messages.stream(
model="claude-opus-4-8",
max_tokens=1024,
messages=[{"role": "user", "content": "Write a haiku about streaming."}],
) as stream:
for text in stream.text_stream: # already-decoded text chunks
print(text, end="", flush=True) # flush => no buffering, true live output
# The SDK accumulated everything; grab the final assembled message if you need it.
final = stream.get_final_message()
print("\n\nstop reason:", final.stop_reason)Two details matter. First, text_stream hands you ready-to-use strings — the SDK already unwrapped the text_delta for you, so there's no JSON to parse. Second, flush=True forces each chunk to the terminal immediately; without it your runtime might buffer and you'd lose the live feel entirely.
But the terminal is rarely the goal. Usually your server is the middleman: it streams from the model and re-streams to a browser. The clean way is to forward the SSE through your own endpoint. Here's a minimal relay using the TypeScript SDK and the standard ReadableStream:
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic(); // ANTHROPIC_API_KEY in the env
export async function POST(req: Request) {
const { prompt } = await req.json();
const stream = client.messages.stream({
model: "claude-opus-4-8",
max_tokens: 1024,
messages: [{ role: "user", content: prompt }],
});
// Turn the model's text events into our own SSE feed for the browser.
const body = new ReadableStream({
async start(controller) {
const enc = new TextEncoder();
stream.on("text", (chunk) => {
controller.enqueue(enc.encode(`data: ${JSON.stringify(chunk)}\n\n`));
});
await stream.finalMessage(); // wait until the model is done
controller.enqueue(enc.encode("data: [DONE]\n\n"));
controller.close();
},
});
return new Response(body, {
headers: { "Content-Type": "text/event-stream" },
});
}On the browser side you read that feed. The native EventSource API only does GET requests, so for a POST chat endpoint most apps read the response body with fetch and a reader, or use a small helper library. The principle stays the same: read chunks, append text to the screen, watch for the [DONE] marker.
When to stream — and when not to
Streaming is great for humans reading text in real time. It's the wrong tool when a machine needs the complete, validated answer before doing anything. Here's the quick decision table:
| Situation | Stream? | Why |
|---|---|---|
| Chatbot / writing assistant | Yes | User reads as it types; perceived speed is everything |
| Coding agent showing its work | Yes | Long outputs feel responsive; user can cancel early |
| Background job / batch summarizer | No | No human watching — just take the whole result |
| Need valid JSON before you act | Usually no | A half-streamed object isn't parseable yet |
| Using the answer as a tool input | No | Function calling needs the complete arguments first |
The JSON case is the big gotcha. If you ask the model for structured output and stream it, the partial text you've received mid-stream is invalid JSON — half a brace, a dangling quote. You can't parse it until the stream finishes. So for pure machine-to-machine calls where you only care about the final parsed object, streaming adds complexity for no benefit. Take the whole response in one shot.
There's a middle ground worth knowing: some apps stream and still want structure — for example, showing a partial UI as fields fill in. That's an advanced pattern (partial-JSON or incremental parsing), and it's the bridge into the rest of this streaming and structured outputs topic.
Going deeper
Buffering: the silent killer of streaming
The most common reason "my streaming doesn't stream" is that something in the middle is buffering — holding the chunks back and releasing them in a clump. Reverse proxies (Nginx with proxy_buffering on), serverless gateways, gzip compression layers, and CDNs can all collect your beautiful token stream and deliver it as one blob, killing the whole point. When debugging, test against the model directly first to confirm the stream works, then bisect each hop in your stack.
Streaming tool calls and reasoning
Streaming isn't only for plain text. When a model decides to call a tool, the arguments stream too — as input_json_delta fragments that build up the JSON arguments piece by piece. Reasoning models stream their thinking as thinking_delta events before the visible answer. The point: a single stream can carry multiple kinds of content blocks (text, tool input, thinking), each with its own delta type and index. If you're routing these somewhere, switch on the delta type, don't assume every chunk is display text. This matters for tool use and agent UIs that show actions live.
Token usage and cost
Streaming doesn't change the bill — you pay for the same input and output tokens either way, since pricing is per token regardless of delivery. The usage numbers arrive at the end of the stream (in the message_delta / final event), so if you log cost per request, capture it from the closing event, not the deltas. The one real saving is cancellation: stop a runaway generation early and you stop paying for the un-generated tokens.
Reliability: drops, retries, and idempotency
An open connection can drop mid-stream — flaky mobile networks, proxy timeouts, a server restart. The SSE spec includes an event id and a Last-Event-ID header for resuming, but LLM streams generally can't be resumed token-perfect, so the practical strategy is: buffer what you've received, and on failure either retry the whole call or show the user a clean "connection lost, retry?" rather than a frozen half-answer. Always set a timeout, and always handle the error event — a stream that silently stalls is worse than one that fails loudly.
Backpressure and slow consumers
If your downstream (the browser, a database write, a slow client) can't keep up with how fast tokens arrive, chunks pile up in memory. Production-grade relays respect backpressure — they pause reading from the model when the consumer's buffer is full. The web ReadableStream and most SDK stream objects support this; the danger is in naive code that just pushes every chunk into an unbounded array. For high-traffic services this becomes a real LLMOps concern.
FAQ
What is streaming in an LLM API?
It's an option where the API sends the model's answer in small pieces (tokens) as they're generated, instead of waiting for the full reply. You enable it with "stream": true, and the text arrives over server-sent events so your app can display it live with that typewriter effect.
How do I stream a ChatGPT or Claude response?
Set the stream flag in your request and loop over the events the API returns. The easiest path is the official SDK: in Python you iterate stream.text_stream, in TypeScript you listen for the text event. The SDK parses the server-sent events for you, so you just print or forward each chunk.
What are server-sent events (SSE)?
SSE is a web standard for a server to push a stream of text messages to a client over one open HTTP connection. Each message is a block of event: and data: lines separated by a blank line. LLM APIs use SSE as the transport for token-by-token streaming.
Does streaming make the LLM faster or cheaper?
Neither, really. Total generation time and token cost are the same as a non-streamed call — pricing is per token regardless. Streaming improves perceived speed by showing the first words in well under a second, and it lets users cancel early, which is the only way it saves money.
When should I not use streaming?
Skip it when no human is watching the output as it generates — background jobs, batch processing — or when you need a complete, valid result before acting. Mid-stream JSON is unparseable, so if you only want the final structured object or tool arguments, just take the whole response at once.
Why is my streaming response arriving all at once instead of gradually?
Something in your stack is buffering. Common culprits are a reverse proxy (e.g. Nginx proxy_buffering), gzip compression, a serverless gateway, or a CDN holding chunks back. Confirm the stream works against the model directly, then check each hop, and make sure you flush output instead of letting your runtime buffer it.