In plain English
Try multiplying 17 × 24 in your head, right now, with no pause allowed. Most people can't. Now do it on paper: 17 × 20 is 340, 17 × 4 is 68, 340 + 68 is 408. Same brain, totally different result — because the paper holds your intermediate steps so you don't have to juggle them all at once.
Chain-of-thought prompting (CoT) gives a language model that same piece of paper. Instead of asking for the answer directly, you ask the model to write out its reasoning — step by step, in plain text — before it states the answer. That visible sequence of steps is the "chain of thought," and on math, logic, and other multi-step problems it can turn a model that confidently gets the wrong answer into one that reliably gets the right one.
The technique became famous through one absurdly simple trick: appending "Let's think step by step" to a prompt. A 2022 paper by Takeshi Kojima and colleagues showed that this single sentence — no examples, no fine-tuning, nothing else — dramatically improved large models' accuracy on arithmetic and logic benchmarks. The term itself comes from an earlier 2022 paper by Jason Wei and colleagues at Google, who got the same effect by showing the model worked examples with the reasoning written out.
Why it matters
An LLM generates text one token at a time, and each token gets roughly the same fixed amount of computation. When you demand a direct answer to a multi-step problem, you're asking the model to compress the entire solution into the handful of tokens that make up the answer — the equivalent of forcing you to blurt out 17 × 24 with no scratch paper. For problems that genuinely require several dependent steps, that often just doesn't fit.
Before CoT, the working assumption was that better reasoning required bigger models or task-specific fine-tuning. The Wei et al. result flipped that: the reasoning ability was already in the large models, latent, and a prompt change unlocked it. On the GSM8K math benchmark, few-shot CoT took the largest models from embarrassing to genuinely useful — while doing almost nothing for small models. Reasoning-by-prompting turned out to be an emergent ability of scale.
Who should care today:
- Anyone whose prompts involve math, dates, units, or counting. These are exactly the tasks where direct answers fail silently.
- Builders of classification or extraction pipelines with tricky rubrics. Making the model state why before what catches edge cases.
- Anyone debugging a model's wrong answers. A visible chain shows you where the reasoning went off the rails, which an answer-only prompt never can.
- Agent builders. Plan-then-act is chain-of-thought applied to actions instead of arithmetic.
CoT also matters historically: it's the direct ancestor of today's reasoning models, which take the same idea — spend tokens reasoning before answering — and train it in with reinforcement learning instead of asking for it politely.
How it works
The mechanism falls out of how generation works. An LLM is autoregressive: every token it produces is appended to the input and becomes context for the next token. The model's output is also its working memory.
So when the model writes "2 cans of 3 balls is 6 balls," that sentence is now sitting in the context, available for attention, for the rest of the generation. The intermediate result has been computed, written down, and can be read back instead of re-derived. A chain of thought is a scratchpad the model builds for itself, one line at a time.
This also explains why direct answers hit a wall. A transformer has a fixed number of layers, so there's a hard ceiling on how much sequential computation can happen inside a single token. A problem that needs ten dependent steps can't be solved in one token's worth of compute — but it can be solved across ten tokens' worth. More reasoning tokens before the answer literally means more serial computation spent on the problem. Tokens are compute.
- One shot at the answer
- All steps crammed into few tokens
- Errors are invisible
- Fast and cheap
- Problem decomposed into steps
- More tokens = more compute
- Errors visible mid-chain
- Slower, more output tokens
Zero-shot vs few-shot CoT
There are two classic ways to trigger a chain of thought, and they map directly onto the broader zero-shot vs few-shot split.
Zero-shot CoT: just ask
Append a trigger phrase — "Let's think step by step," "Reason through this before answering," "Show your work" — and let the model improvise the reasoning format. Zero setup, works surprisingly well:
Q: The cafeteria had 23 apples. They used 20 to make lunch
and bought 6 more. How many apples do they have?
Let's think step by step.Few-shot CoT: show worked examples
The original Wei et al. formulation: include one or more example question–answer pairs where the answer demonstrates the reasoning. The model imitates the style — including your decomposition strategy, your format, and your final-answer convention:
Q: Roger has 5 tennis balls. He buys 2 more cans of 3 tennis
balls each. How many tennis balls does he have now?
A: Roger started with 5 balls. 2 cans of 3 balls is 6 balls.
5 + 6 = 11. The answer is 11.
Q: The cafeteria had 23 apples. They used 20 to make lunch
and bought 6 more. How many apples do they have?
A:| Zero-shot CoT | Few-shot CoT | |
|---|---|---|
| Setup cost | One trigger phrase | Hand-write worked examples |
| Control over reasoning style | Low — model improvises | High — model imitates your examples |
| Prompt length | Tiny | Grows with each example |
| Best for | Quick wins, exploration | Production prompts with a fixed format |
In production you usually want one more ingredient: a clean way to separate the reasoning from the answer, so your code can parse one and log the other. Tags work well:
import anthropic
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY ("sk-...")
PROMPT = """A store sells pens in packs of 12 for $3.60, or singly
for $0.40 each. What is the cheapest way to buy exactly 30 pens,
and what does it cost?
Reason through the problem step by step inside <thinking> tags.
Then give only the final answer inside <answer> tags."""
response = client.messages.create(
model="claude-opus-4-8",
max_tokens=1024,
messages=[{"role": "user", "content": PROMPT}],
)
text = response.content[0].text
# Log the full chain for debugging; show the user only the answer
answer = text.split("<answer>")[1].split("</answer>")[0].strip()
print(answer) # 2 packs ($7.20) + 6 singles ($2.40) = $9.60When to use it — and when to skip it
CoT is not a universal seasoning to sprinkle on every prompt. It shines on a specific class of problems and is wasted — or actively harmful — on others.
| Reach for CoT | Skip CoT |
|---|---|
| Arithmetic and word problems | Simple factual lookups ("capital of France") |
| Date, unit, and calendar logic | Tasks the model already aces directly |
| Multi-criteria classification rubrics | Creative writing — steps can flatten the prose |
| Symbolic puzzles, logic chains | Latency-critical paths with no accuracy gain |
| Debugging code or tracing execution | Reasoning models that already think on their own |
Three practical rules. First, measure: run your eval set with and without the CoT instruction — if accuracy doesn't move, you're paying extra tokens and latency for nothing. Second, budget for the cost: a chain can be 5–20× the tokens of a bare answer, and output tokens are the expensive kind. Third, check what model you're on: modern reasoning models generate their own internal chains, so bolting "think step by step" onto them is usually redundant — control their thinking through the API's reasoning settings instead, as covered in how thinking tokens work.
Going deeper
CoT spawned a family of techniques. Once reasoning lives in sampled text, you can manipulate it like any other output. Self-consistency samples many independent chains and takes a majority vote on the final answer — different chains make different mistakes, and the errors wash out. Tree-of-thought goes further: instead of one linear chain, the model proposes multiple candidate steps, evaluates them, and backtracks from dead ends — search over reasoning rather than a single rollout.
Then the labs trained it in. Reasoning models are, mechanically, CoT internalized: reinforcement learning rewards the model for producing long private reasoning traces that lead to verifiably correct answers. The chain moves out of your prompt and into the model's "thinking" phase, you stop writing trigger phrases, and the cost model shifts — you pay for thinking tokens you may never read. The prompt-engineering skill becomes deciding how much thinking a task deserves, not eliciting it.
Faithfulness is the open wound. Turpin et al.'s 2023 paper Language Models Don't Always Say What They Think showed you can bias a model toward a wrong answer (for instance, by making the correct option always "(A)" in the few-shot examples) and the model will produce a plausible chain of thought justifying the wrong answer — never mentioning the bias that actually drove it. The chain optimizes for looking like reasoning; whether it is the reasoning varies by task and model. This is an active interpretability research area, and the practical takeaway stands: chains are a debugging aid and an accuracy lever, not ground truth.
Production patterns worth knowing. Long chains interact badly with tight max_tokens limits — if the model runs out of budget mid-chain, you get reasoning and no answer, so always leave headroom and parse defensively. For pipelines that need inspectable intermediate state, consider splitting reason-then-answer into two separate calls — that's prompt chaining, and it trades latency for control. And when chains get long, stream the output: users tolerate visible progress far better than a ten-second silence.
Where the research is heading. Three threads to watch: latent reasoning (doing the step-by-step computation in hidden activations instead of text, trading interpretability for efficiency), chain-length control (models that spend many tokens on hard problems and few on easy ones, rather than padding everything), and faithfulness training (making the stated chain causally match the computation). The 2022 trick of asking nicely turned out to be the opening move in a much longer game about how — and where — models should think.
FAQ
Does adding "think step by step" still work on modern models?
It still helps on non-reasoning models for genuinely multi-step problems, but the gains are smaller than in 2022 — instruction-tuned models often reason out loud by default. On reasoning models that generate internal thinking tokens, the phrase is mostly redundant: the model already does this, and you control the depth through API settings rather than the prompt.
What's the difference between chain-of-thought prompting and a reasoning model?
Chain-of-thought is something you ask for in the prompt: the model writes its reasoning as visible output text. A reasoning model was trained — typically with reinforcement learning — to produce a reasoning phase on its own before every answer, usually as separate "thinking" tokens. Same core idea (spend tokens reasoning before answering); one is elicited with words, the other is baked into the weights.
Does chain-of-thought prompting make API calls more expensive?
Yes. The chain is output tokens — the expensive kind — and a thorough chain can be 5–20 times longer than a bare answer, with matching latency. That's fine when it converts wrong answers into right ones. Scope it: apply CoT to the tasks where your evals show a gain, and skip it on lookups and easy classifications.
How do I hide the reasoning and show users only the final answer?
Ask the model to put its reasoning inside one tag (like <thinking>) and the answer inside another (like <answer>), then parse out the answer in your code and log the rest. Alternatively, split it into two calls — one that reasons, one that formats the final answer — which is the prompt-chaining pattern.
Does chain-of-thought work on small models?
Poorly, in the original sense. The 2022 results showed CoT is emergent with scale: small models produce fluent-looking chains full of arithmetic and logic errors, sometimes scoring worse than direct answers. Modern small models trained or distilled on reasoning traces do considerably better, but the rule of thumb holds — the weaker the model, the less a prompt trick can extract from it.