In plain English
An LLM generating a response is not like a database returning a row. A database query takes milliseconds. An LLM generating 400 tokens at a typical hosted-API speed takes 3–10 seconds, and a long reasoning response or document summary can run 30 seconds or more. That is not a bug — it is how the technology works. The model reads every token it has generated so far before deciding what to write next, so generation time grows with output length.
Designing for LLM latency means making those seconds feel tolerable — or even reassuring — rather than broken. The three main tools are streaming (show each word as it is generated), loading skeletons (show a placeholder shaped like the expected answer while the first token is on its way), and a stop button (let users cancel a generation they no longer want). Together they transform a 7-second white-screen freeze into an experience that feels roughly as fast as watching someone type.
Think of it like watching a Polaroid develop. If someone handed you a blank white square and silently walked away, you'd assume something broke. But if you can see the image slowly appearing, the same 60-second wait feels intentional and alive. Streaming text is the LLM equivalent of that developing photograph.
Why it matters
Latency is the most common reason users abandon an AI feature after the first try. Research on human-computer interaction consistently finds that under 100 ms feels instant, under 1 second keeps the user's flow of thought intact, and over 10 seconds causes most users to lose attention or assume the page is broken. LLMs routinely live in the 3–10 second range for a first response — solidly inside the danger zone where users start refreshing or switching tabs.
There are three practical costs when you ignore latency design:
- Abandoned sessions. Users who see a spinner for 5+ seconds with no feedback frequently refresh the page, which cancels the request entirely, wastes the compute you already paid for, and leaves the user thinking the product is broken.
- Wasted API spend. Without a stop button, a user who gets the answer they need in the first paragraph still waits (and you still pay) for the remaining 1,000 tokens the model is generating. On high-traffic apps this is measurable money.
- Trust erosion. Silent waiting with no progress signal makes users less confident in the answer that finally arrives, not more. A skeleton that shows the response structure materialising is psychologically reassuring even if the actual text isn't there yet.
The flip side: when latency is designed well, it stops being a liability and becomes a feature. Watching a ChatGPT or Claude response stream in feels like watching a knowledgeable colleague think out loud. Users who see text appearing immediately rate the product as faster even when total generation time is identical to a batch response. Perceived speed and actual speed are different quantities, and you can move perceived speed without changing a single line of model code.
How it works
Three concepts underpin all LLM latency UX: Time to First Token (TTFT), inter-token latency, and the streaming protocol. Understanding them tells you which UX pattern to reach for in which situation.
Time to First Token (TTFT)
TTFT is the gap between submitting a request and receiving the very first output token. During this window the model is reading your prompt, running attention over the full context, and warming up the key-value cache. Depending on prompt length, model size, and server load, TTFT typically ranges from 200 ms on a fast dedicated deployment up to 2–4 seconds on a busy shared API. This is the window where a skeleton or a status message matters most — the user sees nothing happening unless you put something there deliberately.
Inter-token latency
Once the first token arrives, subsequent tokens follow at a rate determined by GPU throughput — typically 20–80 tokens per second on major hosted APIs, which translates to roughly 15–60 words per second. At 40 tokens/s a 400-token reply fully streams in about 10 seconds, but the user is reading the whole time rather than waiting. This is why streaming transforms perceived latency so dramatically.
The streaming protocol: Server-Sent Events
Most LLM APIs — OpenAI, Anthropic, Google, and the Vercel AI SDK — deliver streamed tokens over Server-Sent Events (SSE). SSE is a one-way HTTP channel: the server pushes small chunks of text whenever a token is ready, and the browser receives them incrementally. Each chunk is a small JSON payload (or a data: line in the SSE format) that your frontend appends to the displayed text. The connection closes when the model emits a [DONE] sentinel.
const controller = new AbortController();
const response = await fetch('/api/chat', {
method: 'POST',
body: JSON.stringify({ messages }),
signal: controller.signal, // <-- wired to the Stop button
});
const reader = response.body!.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
appendToUI(chunk); // show each token as it arrives
}
// Stop button handler:
stopButton.onclick = () => controller.abort();The right loading state for each phase
Not every part of an LLM response lifecycle looks the same, and the right visual treatment differs by phase. Using a spinner everywhere is lazy and unhelpful; using a skeleton in the wrong moment creates false expectations. Here is how to think about it.
| Phase | What the user sees | Best loading pattern |
|---|---|---|
| TTFT window (0 → first token) | Nothing has arrived yet | Animated skeleton shaped like the expected output |
| Streaming in progress | Tokens arriving steadily | Live text append with a blinking cursor at the tail |
| Tool call / retrieval mid-stream | Model paused while fetching data | Inline status line: "Searching…" or "Reading document…" |
| Error or timeout | Generation stopped unexpectedly | Error message + Retry button; keep partial text visible |
| User cancelled (Stop) | User clicked Stop | Preserve partial text; re-enable input immediately |
Skeletons: why they beat spinners for AI
A spinner says something is happening. A skeleton says here is roughly what is about to appear. For AI responses, skeletons are substantially better because they set spatial expectations — the user's eye is already positioned where the text will land, so when tokens arrive there is no jarring layout shift. Research on web UX finds that skeleton screens feel 20% faster than spinners for identical wait times, even though both are purely cosmetic. The pulse or shimmer animation on a skeleton also signals ongoing progress rather than static blocking — 300–700 ms cycles work best.
For a chat interface, a good skeleton is two or three grey rounded lines of varying width sitting in the assistant bubble — close enough to the shape of a real reply that the transition from placeholder to text feels smooth rather than jumpy. For a document-generation feature, the skeleton might be a full-page layout of grey lines. The key rule: the skeleton should look like the answer you expect, not a generic loader.
Status lines for agentic flows
When an LLM is part of a multi-step agent — calling tools, running searches, reading files — there are often 5–20 seconds of non-streaming work between the user's message and the first output token. A skeleton alone is not enough; you need a status line that advances with each step. This is what you see in Perplexity's "Searching the web…", Claude's "Reading document…", or OpenAI's "Running code…" animations. Each line change tells the user the system is making progress and gives a hint about what kind of answer is coming.
Perceived-speed tricks that cost nothing
Beyond the core patterns, there is a set of low-effort techniques that measurably improve how fast an AI feature feels without any model changes:
Show the user's message immediately (optimistic UI)
The moment the user presses Send, add their message to the conversation thread and show the assistant skeleton below it — before any network request has returned. This makes the round-trip latency feel like model latency only, not model latency plus network time. Users perceive the interface as responding to their action immediately.
Use a typing cursor, not a blinking box
A blinking block cursor at the end of the streaming text is a small detail that has a disproportionate impact. It signals the system is still writing in a way that matches how humans intuitively interpret someone typing. Without it, users often mistake a mid-stream pause (the model choosing a complex word) for a completed — or crashed — response.
Set expectations with a status word
For operations you know will be slow — a long document summary, a multi-step agent task — a single line of text before the skeleton pays dividends. "Analysing your document..." or "Searching for recent news..." reframes the wait from broken to working on it. Studies on conversational agents find that even a brief filler phrase reduces perceived delay and improves satisfaction ratings, independent of actual latency.
Avoid layout shifts when the response arrives
If the skeleton does not match the shape of the real response, the UI lurches when the transition happens. Reserve space in your layout for the assistant bubble before content arrives — either with a min-height on the container or a skeleton that approximates the expected response length. For known-length outputs (a one-line classification, a yes/no) use a single-line skeleton rather than the default multi-paragraph placeholder.
- Blank screen for 4–8 seconds
- No way to cancel generation
- Spinner that looks like a broken page
- Page jumps when answer appears
- Input stays blocked until done
- Users refresh and lose context
- Skeleton appears in < 100 ms
- Stop button cancels generation
- Status line shows what model is doing
- Smooth skeleton-to-text transition
- Input blocked but Stop is always reachable
- Partial text preserved on cancel
Going deeper
Once you have mastered the baseline patterns above, there are more sophisticated techniques for squeezing additional perceived — and actual — performance out of an LLM application.
Semantic caching
A semantic cache stores the vector embedding of previous prompts alongside their responses. When a new prompt is sufficiently similar (above a cosine-similarity threshold), the cached response is returned instantly without hitting the model at all. For apps where many users ask the same class of question — a customer support bot, a product FAQ assistant — hit rates of 30–60% are achievable, eliminating latency entirely for those requests. Redis, Upstash, and GPTCache are commonly used for this pattern.
Prompt caching (provider-level)
Both Anthropic and OpenAI offer prompt caching: if you send the same large system prompt or document prefix repeatedly, the key-value cache from the first call is reused on subsequent calls, reducing TTFT by 80–90% and cost by up to 90% for the cached portion. For apps with a fixed system prompt or a large static context (a 50-page manual, a codebase), enabling this is one of the highest-leverage latency improvements available with zero frontend work.
Speculative decoding and quantization
When self-hosting a model, speculative decoding uses a small fast draft model to predict several tokens ahead, then verifies them in parallel with the main model — achieving 2–3x throughput gains with no quality loss. Quantization (running the model in INT8 or INT4 instead of FP16) cuts memory bandwidth requirements and lifts tokens-per-second by 1.5–2x. Neither technique is something you configure in a hosted API call, but understanding them helps you evaluate provider benchmarks and choose between deployments.
Parallelising agent steps
Multi-step agentic workflows often run steps that are independent of each other serially by default. If your agent needs to search three different sources before drafting an answer, fetching them in parallel cuts wall-clock time by roughly 60–70%. Most orchestration frameworks (LangGraph, the Vercel AI SDK's tool parallelism) support parallel tool calls natively — you just need to structure your workflow to request them at the same time rather than in sequence.
Resumable streams
For long-running generations (code generation, document drafts), a network drop mid-stream means starting over. Resumable streams persist generation state server-side and allow the client to reconnect and continue from where it left off. The Vercel AI SDK's resumeStream feature and Anthropic's streaming API both support this pattern, though it requires careful state management: you must choose between resumability and instant abort (an abort on a resumable stream is treated as a disconnect, not a cancellation).
FAQ
Does streaming actually make my app faster, or just feel faster?
Streaming does not change total generation time at all — the model still generates the same number of tokens at the same speed. What changes is perceived speed: users see the first token within the TTFT window (often under 500 ms) rather than waiting for the full response. Research consistently shows users rate streamed interfaces as significantly faster even when total time is identical to a batch response.
When should I use a skeleton instead of a spinner?
Use a skeleton whenever you know the rough shape of the response that is coming — a chat bubble, a card layout, a list. A skeleton sets spatial expectations so the transition from loading to real content is smooth. Use a spinner (or nothing) only when you have no idea what shape the response will take, or when the wait is under ~300 ms and a skeleton would flash too briefly to be useful.
How do I make the stop button actually cancel the API call?
You need an AbortController whose .signal is passed to the fetch() call on the client, and whose abort event is forwarded to the upstream LLM API call on the server. Simply hiding the UI without cancelling the HTTP connection keeps the server generating tokens and keeps the billing meter running. Most frameworks like the Vercel AI SDK handle this automatically via their stop() helper, but you should verify the upstream request is truly cancelled.
What should I show when an LLM tool call takes a long time mid-stream?
Show an inline status line that updates as each tool executes: "Searching the web...", "Reading 3 documents...", "Running calculation...". These status updates tell the user the system is making progress and prime their expectations for the type of answer coming. A skeleton with no status change during a 15-second tool call feels frozen — the status line is what differentiates a thinking system from a broken one.
Should I block the input field during streaming?
Yes. Allowing a new submission while the current stream is in progress creates race conditions: responses can arrive out of order, the state machine gets complicated, and the UI can become incoherent. Block the submit button and text input during streaming, but keep the Stop button prominent and active at all times. Re-enable input the instant the stream ends or is cancelled.
What is TTFT and why does it matter more than tokens-per-second?
TTFT (Time to First Token) is the delay between sending a request and seeing the first character of output. It is the dead-silent period that users experience as the model "thinking". Tokens-per-second determines how fast text flows once it starts. TTFT dominates perceived responsiveness — a model with 300 ms TTFT and 40 tokens/s will feel far faster than one with 4 s TTFT and 80 tokens/s, even though the second model has higher raw throughput.