In plain English
A "shot" is just an example. That's the whole secret. Zero-shot prompting means you describe the task and give the model no examples. One-shot means you include exactly one worked example. Few-shot means you include a handful — usually two to eight — before asking your real question. The jargon sounds academic, but it's counting, nothing more.
Think of onboarding a new hire to file expense reports. Zero-shot is telling them: "Fill out the expense form, one row per receipt, categorize each purchase." One-shot is handing them a single completed report and saying "like this." Few-shot is handing them three completed reports — one normal, one with a foreign-currency receipt, one with a rejected line item — so they've seen the edge cases before they hit them. Same task, three levels of demonstration.
Here's what that looks like with a language model doing sentiment labeling. Zero-shot: "Classify this review as positive or negative: 'The battery died in two hours.'" Few-shot: "'Best purchase I ever made' → positive. 'Arrived broken, support ignored me' → negative. 'Fine I guess, does the job' → neutral. Now classify: 'The battery died in two hours.'" The second prompt doesn't just ask — it shows. The model sees the pattern and continues it.
Why it matters
Before 2020, teaching a model a new task meant collecting thousands of labeled examples and fine-tuning a custom model — days of work, real money, one model per task. The GPT-3 paper showed that once models get big enough, they can pick up a brand-new task from a few examples placed directly in the prompt, with zero retraining. That single finding is why prompt engineering exists as a discipline: the prompt became the programming interface.
For anyone building on LLM APIs today, this distinction is the cheapest reliability lever you have. Adding three good examples to a prompt takes five minutes and routinely fixes problems that no amount of instruction-rewriting can: output that drifts between formats, labels applied inconsistently, a tone that's almost-but-not-quite your house style.
Examples are especially powerful for things that are easy to show but painful to describe. Try writing instructions for "summaries in our company voice" and you'll produce three paragraphs of adjectives the model half-follows. Paste in two real summaries your team loved, and the model nails it. Format mimicry, fuzzy category boundaries, judgment calls — these all transmit better by demonstration than by description.
- Zero-shot replaced: nothing — it's the default. You're doing it every time you type a plain request into a chatbot.
- Few-shot replaced: small-scale fine-tuning. Tasks that once required a labeled dataset and a training run now often need three examples and a paragraph of instructions.
- Who should care: anyone shipping classification, extraction, formatting, or style-sensitive generation through an LLM API — which is most LLM products.
How it works
The mechanism behind few-shot prompting is called in-context learning, and the name is slightly misleading — the model doesn't learn anything permanent. Its weights don't change. An LLM is a next-token predictor: it looks at everything in the context window and predicts what text comes next. When your prompt contains three input→output pairs followed by a fourth input, the statistically obvious continuation is a fourth output in the same shape. The model is pattern-completing, the way you'd finish the sequence "2, 4, 6, …" without anyone teaching you a rule.
This is why the effect evaporates the moment the request ends. Send a second API call without the examples and the model behaves as if it never saw them. In-context learning is per-request rental, not ownership — which is also its superpower: you can change the task instantly by changing the examples, no training pipeline required.
- Instructions only
- No examples
- Fastest to write, cheapest tokens
- Relies on the model already knowing the task shape
- Instructions + 1 example
- Anchors the output format
- Risk: model overfits to that single example
- Instructions + 2–8 examples
- Shows edge cases and label boundaries
- Most reliable for format and style
- Costs tokens on every call
What do the examples actually transmit? Four things, roughly in order of strength: the output format (JSON vs prose, label vocabulary, length), the label space (which categories exist and where their borders are), the tone and style, and the level of detail expected. Notice what's not on the list: facts. Examples shape how the model answers far more than what it knows.
A few-shot prompt in code
There are two standard ways to embed examples. The first is fake conversation turns: you write each example as a user message paired with the ideal assistant reply, so the model sees a transcript of itself already doing the task correctly. This works with any chat-style API:
import anthropic
client = anthropic.Anthropic(api_key="sk-ant-...")
response = client.messages.create(
model="claude-opus-4-8", # any modern chat model works
max_tokens=10,
system=(
"You classify support tickets as exactly one of: "
"billing, bug, feature_request. Reply with the label only."
),
messages=[
# --- the "shots": each pair is a worked example ---
{"role": "user", "content": "I was charged twice this month."},
{"role": "assistant", "content": "billing"},
{"role": "user", "content": "The export button does nothing when I click it."},
{"role": "assistant", "content": "bug"},
{"role": "user", "content": "Would love a dark mode for night shifts."},
{"role": "assistant", "content": "feature_request"},
# --- the real input ---
{"role": "user", "content": "My invoice shows the wrong company name."},
],
)
print(response.content[0].text) # -> billing
The second way is inline examples inside a single prompt, each wrapped in a delimiter like <example>...</example> tags so the model can't confuse demonstrations with instructions. Both styles work; message pairs tend to anchor format hardest, while inline tags keep everything in one reusable string — handy when your examples live inside a prompt template with variables.
When to use which
The professional workflow is zero-shot first. Modern instruction-tuned models are good enough that clear instructions alone solve a surprising share of tasks — and a zero-shot prompt is shorter, cheaper, and easier to maintain. Add examples only when you observe a specific failure: format drift, inconsistent labels, wrong tone. Each example you add should fix a failure you actually saw, not a failure you imagined.
| Situation | Start with | Why |
|---|---|---|
| General knowledge Q&A, summarization, translation | Zero-shot | The model has seen millions of these; instructions suffice |
| Strict output format (JSON shape, CSV, label-only replies) | Few-shot (2–3) | Demonstration anchors format better than description |
| Fuzzy category boundaries (spam vs promo, urgent vs normal) | Few-shot (3–5) | Borderline examples define where the line sits |
| House style or brand voice | Few-shot (2–4) | Style transmits by imitation, not adjectives |
| Multi-step reasoning or math | Zero-shot + reasoning technique | Examples of answers don't teach the process |
| Task too complex for any prompt | Fine-tuning | Past ~dozens of examples, training beats prompting |
How many shots? Three to five is the working default — enough to establish a pattern and cover an edge case or two, before token cost and diminishing returns kick in. Picking which examples matters more than piling on more of them; we cover selection strategy in how many few-shot examples do you need.
Common pitfalls
- Unbalanced examples skew output. Three positive-sentiment examples and zero negative ones quietly teach the model that everything is positive. Cover each label, roughly evenly.
- The last example pulls hardest. Models show recency bias: outputs drift toward the final example's style and label. If one example must dominate, put it last — and if none should, shuffle order while testing.
- Examples that are too uniform teach the wrong pattern. If every example input is one sentence long, the model may choke on a paragraph. Vary surface features that shouldn't matter so the model learns they don't.
- Stale examples after the task evolves. You changed the label set in the instructions but forgot the examples still use the old one. Now the prompt contradicts itself. Version your prompts and update examples and instructions together.
- Paying for examples on every call. Few-shot tokens are sent with every single request. For high-volume endpoints that's real money and latency — one reason to keep the example block tight.
Going deeper
In-context learning is an emergent ability: small models barely benefit from prompt examples, while large models extract striking value from them — that scaling effect was the central result of the GPT-3 paper. What's stranger is what the examples contribute. A well-known 2022 study (Min et al., "Rethinking the Role of Demonstrations") found that replacing the gold labels in few-shot examples with random labels barely hurt classification accuracy on the models tested. The demonstrations were doing their work through format, input distribution, and label space — not through the correctness of the input→label mapping. Follow-up research complicated the picture: larger, stronger models do read the mapping, and will even follow deliberately flipped labels against their own prior knowledge. Practical takeaway: get the format and label space right always; sweat label correctness more as models get stronger.
Long context windows opened up many-shot in-context learning — hundreds or even thousands of examples in one prompt. Research from Google DeepMind showed performance often keeps climbing well past the classic 5-shot regime, to the point where many-shot prompting starts competing with fine-tuning on some tasks. The trade-off is brutal token cost per call, which is where prompt caching matters: a large, stable example block placed at the start of the prompt can be cached by the provider and re-served cheaply, but only if you don't churn its contents between calls.
Production systems increasingly use dynamic few-shot selection instead of a fixed example block: embed your example library, then at request time retrieve the handful of examples most similar to the incoming input and splice them into the prompt. It's retrieval-augmented generation pointed at examples rather than documents, and it consistently beats static examples on heterogeneous traffic — at the cost of a vector lookup per request and a cache-unfriendly prompt.
Finally, a frontier wrinkle: reasoning models change the calculus. The DeepSeek-R1 technical report found that few-shot prompting consistently degraded R1's performance, recommending zero-shot prompts that simply describe the problem — the examples seem to interfere with the model's own reasoning trace. Models that deliberate internally need the goal, not a demonstration. If you're working with reasoning models, start zero-shot and treat examples as a last resort; for older instruct models, the classic move of demonstrating worked solutions overlaps heavily with chain-of-thought prompting, where the example shows the steps rather than just the answer.
FAQ
What does "shot" mean in zero-shot and few-shot prompting?
A "shot" is one worked example included in the prompt. Zero-shot means no examples (instructions only), one-shot means exactly one example, and few-shot means a handful — typically two to eight input→output pairs shown before your real input.
Is few-shot prompting the same as fine-tuning?
No. Few-shot prompting puts examples in the prompt at request time — the model's weights never change, and the effect disappears after the call. Fine-tuning permanently updates the model's weights using a training dataset. Few-shot is instant and reversible; fine-tuning is durable but requires data, time, and money.
How many examples should a few-shot prompt include?
Three to five is the standard working range — enough to establish the pattern and cover an edge case, before token cost and diminishing returns dominate. Quality and coverage beat quantity: one well-chosen borderline example is worth more than three near-duplicates.
What is the difference between one-shot and few-shot prompting?
Only the count. One-shot includes a single example, which anchors output format but risks the model overfitting to that one demonstration. Few-shot includes several, letting you show variation, edge cases, and the boundaries between categories — which generally makes behavior more robust.
Does few-shot prompting work with reasoning models?
Often it backfires. The DeepSeek-R1 technical report found few-shot prompting consistently degraded its performance, and recommended plain zero-shot problem descriptions. Reasoning models generate their own internal deliberation, and prompt examples can interfere with it — describe the goal clearly and let the model work.
Where do few-shot examples go — system prompt or user messages?
Both work. You can write examples as fake user/assistant message pairs before the real input, or inline them in one prompt wrapped in delimiters like <example> tags. Message pairs anchor format strongly; inline tagged examples keep the prompt in one reusable string. Pick one style and delimit clearly either way.