In plain English
Building a chatbot that actually works is only half the job. The other half is the interface: all the small decisions that make a user feel heard, informed, and in control — or leave them confused, anxious, and gone. Chatbot UX patterns are the recurring design solutions that the best AI chat products have converged on. They cover the moments users experience most: waiting for a reply, reading a long answer, hitting an error, not knowing whether to believe something, wanting to copy a snippet, needing to try again.
This article is the practitioner's counterpart to What Makes Good AI UX. That article explains why each principle matters. This one goes one level deeper: how to implement each pattern concretely, with the specific UI components, states, and code shapes that make them work. It's aimed at developers building chat features — not designers writing style guides.
Why these patterns matter
Most chat interfaces fail at the same predictable points. Users close the tab when the page sits blank for three seconds. They copy the whole response into a note because there's no copy button. They retype the same question with a small tweak because there's no regenerate. They paste a citation somewhere only to find the URL is made up. These failures aren't model problems — they're interface problems that no amount of prompt engineering will fix.
Each pattern below maps directly to a known user frustration. Shipping even a subset of them measurably changes whether people trust the product and come back.
| Pattern | User frustration it solves | Measurable signal |
|---|---|---|
| Streaming | "Is it frozen?" after 3 silent seconds | Perceived response speed; abandonment on slow connections |
| Typing indicator | Uncertainty while the model "thinks" before streaming begins | Reduced rage-clicks on the submit button |
| Error states | Blank screens, broken UI after a failed API call | Fewer full-page reloads; more retries via the retry button |
| Citations | "Should I believe this?" — unverifiable claims | Click-through to sources; reduced hallucination complaints |
| Copy button | Selecting text in a chat bubble is clumsy | Reduced "I had to copy it manually" feedback |
| Regenerate | "The answer was wrong but I don't want to retype" | Fewer conversation abandonments after a bad first reply |
| Conversation history | "I can't find that thing we talked about last week" | Return visit rate; session depth |
How the patterns fit together
The seven patterns cover three distinct phases of any chat interaction: before a reply arrives, while it streams, and after it lands. Thinking in phases makes it easier to see which pattern belongs where and to prioritise what to ship first.
Pattern 1: Streaming responses
LLM streaming sends tokens to the browser as they are generated rather than buffering the full response. The API delivers a stream of server-sent events (SSE) or newline-delimited JSON chunks; the frontend appends each chunk to the message bubble immediately. Total latency is unchanged. Perceived latency drops dramatically because the user sees progress within the first second.
Three implementation details that trip people up: First, markdown can arrive incomplete — a code fence or bold marker may span two chunks. Buffer the incoming string and re-render on each chunk rather than trying to render each chunk in isolation. Second, auto-scroll: scroll the chat container to the bottom on each chunk so the latest text stays visible without requiring the user to scroll manually. Third, a Stop button: bind a cancel function to the stream reader so the user can interrupt a runaway reply.
// Stream tokens from your API route and append each chunk to the UI.
async function streamReply(question: string, setReply: (s: string) => void) {
const res = await fetch("/api/chat", {
method: "POST",
body: JSON.stringify({ question }),
});
const reader = res.body!.getReader();
const decoder = new TextDecoder();
let accumulated = "";
while (true) {
const { done, value } = await reader.read();
if (done) break;
accumulated += decoder.decode(value, { stream: true });
setReply(accumulated); // re-render the whole accumulated string each chunk
}
}Pattern 2: Typing indicators
A typing indicator — typically three animated dots — fills the gap between when the user submits and when the first token appears. That gap is called time to first token (TTFT) and can be 1–4 seconds even with a fast model, because the server has to receive the request, build the full prompt with conversation history, and wait for the model to start generating. Without a visual cue, a two-second pause reads as a crash.
The implementation is simple: show the indicator when the request is in flight and the reply bubble is empty; hide it the moment the first token arrives. Many apps combine this with a status label — "Thinking…", "Searching…", "Generating…" — when the model is doing multi-step work like tool calls or retrieval.
// Show the indicator while waiting for the first token.
// Hide it once streaming begins.
function TypingIndicator({ visible }: { visible: boolean }) {
if (!visible) return null;
return (
<div className="typing-indicator" aria-label="AI is typing" role="status">
<span /><span /><span />
</div>
);
}
// In the parent:
// - isWaiting = true after submit, false on first token
// - isStreaming = true while tokens arrive, false when done
<TypingIndicator visible={isWaiting && !isStreaming} />Pattern 3: Error states
LLM API calls fail more often than REST endpoints. Rate limits, network blips, context-window overflows, content policy blocks, and provider outages are all normal operating conditions. An interface that leaves a blank bubble — or worse, a half-streamed message that just stops — trains users to distrust the entire product.
Good error state design has three parts: classification (what kind of error), message (human-readable, specific), and recovery (a retry button or suggested action). The table below lists the four most common failures and the right response for each.
| Error type | User-facing message | Recovery action |
|---|---|---|
| Rate limit (429) | "Too many requests — please wait a moment" | Auto-retry with exponential backoff, or show a countdown |
| Network / timeout | "Couldn't reach the server — check your connection" | Retry button; keep the message in the input so they don't retype |
| Content policy block | "I can't help with that request" | Suggest rephrasing; link to usage policy |
| Context overflow | "This conversation is too long to continue" | "Start a new chat" button; optionally summarise and continue |
For streaming errors (where the stream starts successfully but fails mid-reply), mark the message with a visible truncation notice — "Response interrupted — retry?" — rather than silently leaving an unfinished reply. A half-finished code block with no explanation is worse than no reply at all.
Pattern 4: Citations
Citations are the UI face of retrieval-augmented generation. The model surfaces information from retrieved documents; citations show which document, which claim. Without them the user has no way to verify whether a statement is grounded or hallucinated.
Three citation layouts are in common use, each with different tradeoffs:
- Inline superscript footnotes —
[1]inline, sources listed below. Highest readability, good for long answers. Requires the model to output footnote markers (easy to prompt for) and a rendering layer that parses them. - Inline linked phrases — the claim itself is a hyperlink. Most natural reading experience but hard to implement correctly: the model must output the source URL inside the text, which increases hallucination risk for the URL.
- Source panel — a collapsible "Sources" section below the reply, listing titles and URLs. Easiest to implement (just render
sources[]from the retrieval step) and lowest clutter for simple answers, but creates visual disconnection between claim and source.
Pattern 5: Copy button
Selecting text inside a chat bubble is awkward: the bubble has padding, the pointer accidentally selects the avatar, and on mobile the browser's text-selection handles are clumsy. A copy button solves all of this in three lines of code. Show it on hover (desktop) or as a persistent small icon (mobile). Flash a checkmark for ~1.5 seconds on success so the user knows the clipboard was updated.
import { useState } from "react";
function CopyButton({ text }: { text: string }) {
const [copied, setCopied] = useState(false);
const handleCopy = async () => {
await navigator.clipboard.writeText(text);
setCopied(true);
setTimeout(() => setCopied(false), 1500);
};
return (
<button
onClick={handleCopy}
aria-label={copied ? "Copied" : "Copy response"}
title={copied ? "Copied!" : "Copy"}
>
{copied ? "✓" : "Copy"}
</button>
);
}For code blocks specifically, copy the raw code without the markdown fences — users almost never want the triple-backtick wrappers. Maintain a separate rawCode string stripped of fences before passing it to the copy handler.
Pattern 6: Regenerate
A Regenerate button resubmits the last user message and replaces the previous reply. It solves one of the most common frustrations in chat interfaces: a first answer that's close but wrong, incomplete, or in the wrong tone — where the user doesn't want to retype the whole question but also doesn't want to just accept the bad reply.
Implementation notes: Store the full conversation history as a mutable array. Regenerate pops the last assistant message, optionally tweaks the temperature or system prompt, and re-calls the model. ChatGPT ships tree-based branching (every regenerate creates a new branch you can navigate), and Claude supports edit-and-branch on user turns. For a first version, simple replace-and-resubmit is enough and covers 80% of the use case.
Pattern 7: Conversation history
Conversation history has two distinct problems that are easy to conflate. In-session context is the list of messages you send to the model on every API call so it "remembers" the conversation. Cross-session persistence is the user-facing history list — the sidebar of past conversations that lets someone return to a thread days later. They are solved differently.
In-session context: the memory facade
LLM APIs are stateless: each call knows nothing about the previous one unless you include that history in the request. The standard pattern is the memory facade: maintain a messages array in your frontend or backend state and append every user message + assistant reply to it. Send the full array on every call.
// Each item is { role: 'user' | 'assistant', content: string }
let messages: { role: string; content: string }[] = [];
async function sendMessage(userText: string) {
// 1. Append the new user turn
messages.push({ role: "user", content: userText });
// 2. Send full history to the model
const reply = await callLLM(messages);
// 3. Append the assistant reply for the next call
messages.push({ role: "assistant", content: reply });
return reply;
}This works until the conversation grows longer than the model's context window. Strategies when you hit that limit: rolling window (keep only the last N messages), summarisation (ask the model to summarise old turns into a compact paragraph and replace them), or selective retrieval (embed all messages and retrieve the most relevant ones for each new turn, similar to RAG). These strategies are covered in depth in the context window and RAG articles linked above.
Cross-session persistence: the history sidebar
This is a product feature, not a model feature. Store each conversation as a record in your database with a unique ID, a timestamp, a title (either user-set or auto-generated from the first message), and the serialised message array. On load, fetch the list of past conversations and render them in a sidebar. Clicking one loads the full message array and continues the thread.
UX details that matter here: Auto-title the conversation from the first user message (truncated to ~50 characters) so the sidebar is scannable without requiring the user to name things. Group by date (Today, Yesterday, Last 7 days) — users scan by recency, not alphabetically. Provide a search or filter once the list grows beyond ~20 conversations; a flat unsorted list of 100 chats is unusable.
Going deeper
Once the seven core patterns are in place, several second-order problems come into view. These are the refinements that separate a working chat UI from a polished one.
Buffered markdown rendering
Streaming reveals a rendering problem: markdown delimiters arrive in pieces. A bold marker ** arrives, but its closing ** is three chunks later. A code fence starts, but the language identifier is in the next chunk. Naively rendering each chunk individually produces flickering half-marked text. The standard fix is a buffered renderer: accumulate the full string so far and re-parse it on every chunk. This is slightly more CPU-intensive but produces a smooth, correct output. Libraries like react-markdown handle this cleanly if you pass the accumulated string as the input.
Accessibility: aria-live regions
Screen readers don't notice dynamically appended text by default. Wrap the streaming reply in an aria-live="polite" region so assistive technology reads it out as it arrives. Use aria-live="assertive" for error messages that need immediate attention. These are small additions — one attribute — that make the interface usable for blind or low-vision users who depend on screen readers.
Confidence and uncertainty signals
Some queries have clear answers; others are genuinely uncertain. Mature chat UIs distinguish between them. Rather than false-precision probability scores ("73% confident"), use plain language hedges: "I'm not certain, but…", "I couldn't find a verified source for this", "You may want to double-check this with a specialist". These are easier to implement (prompt the model to hedge when uncertain) and more honest than a number the model can't actually calibrate. The problem of LLM hallucination is structural — calibration helps, but user-visible hedges are the safest UX fallback.
Feedback loops: thumbs up/down
A thumbs-up / thumbs-down button on each reply costs almost nothing to implement and provides two things: a signal to your users that quality is taken seriously, and a dataset you can use for fine-tuning or evaluation. Store each rating with the full conversation context (messages, model, temperature, system prompt) so you can actually debug which configurations produce poor ratings. Many teams treat this as vanity data and never mine it — teams that do mine it find it becomes their best LLM evaluation signal.
Progressive disclosure on long answers
Long model responses are common. Dumping a 2000-word reply into a chat bubble is not great UX. Options: collapse after N lines with a "Show more" toggle (works for most cases), structured sections with anchors (better for document-length outputs), or skeleton outline first (show the response structure, then stream the content into each section). Which pattern fits depends on your use case, but any of them beats a single scrollable wall of text.
Further reading: the LLM streaming explained article covers the server-side plumbing; what is a context window explains the limits on how much history you can send; and what makes good AI UX covers the design principles these patterns implement.
FAQ
What is the most important chatbot UX pattern to implement first?
Streaming. It has the highest impact-to-effort ratio of any pattern: one line of code changes a blank waiting screen into a live, responsive interface. Every other pattern — typing indicators, copy buttons, regenerate — matters more once streaming is in place. Without streaming, the wait itself destroys trust before users see the reply.
How do I show a typing indicator before streaming starts?
Track two state variables: isWaiting (true from submit until the first token) and isStreaming (true from the first token until the stream closes). Show the typing indicator when isWaiting is true and isStreaming is false. Hide it the moment the first chunk arrives. The gap is time-to-first-token — often 1–4 seconds — and the indicator fills that gap so the user knows the system is working.
Should citation URLs come from the model or from my retrieval layer?
Always from your retrieval layer. When you ask the model to generate URLs from memory, it hallucinates them confidently — the URL looks plausible but leads nowhere. Instead, have your retrieval step return verified source URLs alongside the documents, inject them into the system prompt as numbered references, and ask the model only to cite which reference number supports each claim. Your display layer maps the number back to the real URL.
What should a chatbot error state include?
Three things: a human-readable description of what went wrong (not a raw error code), a specific recovery action (a Retry button, a "Start new chat" button, or a suggested rephrasing), and the user's original message pre-filled in the input so they don't need to retype it. Different errors need different messages — a rate limit, a network timeout, a content policy block, and a context overflow each have a distinct cause and a different fix.
How does conversation history work if LLM APIs are stateless?
Every LLM API call is independent and knows nothing about previous calls. The client (your app) maintains the conversation as a messages array — each item has a role (user or assistant) and the text content. On every new turn, you append the new user message and send the full array to the model. The model sees the whole conversation and responds in context. This is called the memory facade pattern.
What is regenerate in a chatbot, and how is it different from retry?
Regenerate resubmits the last user message to get a different reply — the model runs again, usually with different random sampling, and produces an alternative answer. Retry is for errors: it resends the same request when the API call itself failed. Regenerate is a quality tool ("I want a better answer"); retry is a reliability tool ("the request never completed"). Both should be one-click, but they appear at different points: retry after an error state, regenerate after a successful but unsatisfying reply.