In plain English
Few-shot prompting means slipping a handful of worked examples into your prompt before the real request — input, output, input, output — so the model can imitate the pattern. The name comes from machine-learning jargon: "few" contrasts with "zero" (no examples at all) and "many" (hundreds or thousands).
Think of it like briefing a temp worker. You could say "file these documents alphabetically" (zero-shot) and hope for the best. Or you could show them three finished folders first and then hand over the pile. The examples tell them what done actually looks like far better than any written instruction can.
The immediate practical question is always: how many examples? Too few and the pattern is ambiguous. Too many and you're burning token budget, slowing the call, and — according to a 2025 study on over-prompting — potentially hurting performance. The honest answer is it depends, but this article gives you a principled way to find the right number for your task.
Why it matters
Getting the example count wrong is one of the most common, and least obvious, prompt failures. Use too few examples and the model guesses at edge cases. Use too many and you eat up context budget that could hold user documents, system instructions, or retrieved facts — all of which often matter more than a fifteenth demonstration.
The cost is real. Every additional example adds tokens to every single request you make. On a high-volume endpoint that is a direct latency and cost multiplier. A prompt that runs a million times per day with four unnecessary examples could cost more than fine-tuning the task out of the prompt entirely.
There is also a subtler failure mode: homogeneous examples. If all your examples show the easy, happy-path case, the model will confidently apply that pattern to hard cases where a different behaviour is needed. Four carefully chosen diverse examples almost always outperform eight that all look the same.
Finally, example order is not neutral. Research published in 2025 found that reordering identical few-shot examples could swing classification accuracy from near state-of-the-art to near random. Understanding why lets you control the effect rather than be surprised by it.
How it works
When you provide examples, the model uses them to infer the output format, the tone, the vocabulary, and implicit rules you never stated in the instruction. It reads them as a mini-dataset of (input, correct output) pairs and tries to generalise the mapping to the new input.
Performance vs. example count
Gains are large at first and taper fast. Going from zero to one or two examples produces the biggest jump — the model now has something concrete to imitate. Going from two to four produces a moderate improvement. Beyond six or so, gains are typically marginal for most tasks, and a 2025 study ("The Few-shot Dilemma", arxiv:2509.13196) showed that for some models — including GPT-4o, DeepSeek-V3, and several LLaMA-3 variants — excessive domain-specific examples actually degraded performance. The researchers named this over-prompting.
The exception is the many-shot regime: if you have a very long context window (hundreds of thousands of tokens) and hundreds of high-quality examples, Google DeepMind's 2024 NeurIPS research showed consistent gains continuing well past the few-shot plateau, particularly on hard reasoning and non-natural-language tasks. But this requires a deliberate architectural choice, not just pasting more examples.
The attention-budget view
Models have a finite "attention budget". As you add tokens, each new piece of content gets slightly less focus than if the window were shorter. Examples compete for attention with the actual instruction, the user's query, and any retrieved context you include. The practical implication: once your examples are covering the diversity of the task adequately, adding more examples crowds out more important material.
| Count | Typical gain | When to stop |
|---|---|---|
| 0 → 1 | Very large — model now has a concrete pattern to follow | If one example fully specifies the format, that may be enough |
| 1 → 3 | Large — ambiguity about edge cases starts to resolve | Good default stopping point for simple, well-defined tasks |
| 3 → 6 | Moderate — edge cases, tone variation, hard inputs covered | Aim here for tasks with real input diversity |
| 6 → 10 | Small — usually marginal unless task is highly varied | Stop unless you see clear eval improvements |
| 10+ | Often flat or negative (over-prompting risk) | Only in explicit many-shot setups with very long context windows |
How to pick the right examples
The most important principle is diversity over quantity. Each example should teach the model something the others don't. Before adding an example, ask: does this cover a slice of the input space not already represented?
The coverage checklist
- A simple, canonical case. This anchors the basic pattern — length, format, tone.
- A harder or more complex case. Shows what the model should do when the input is messy or information is incomplete.
- An ambiguous or edge case. Demonstrates the decision rule for inputs that fall in the grey area.
- A negative or rejection case (if relevant). Explicitly showing what the model should not do ("if input is X, output is 'N/A'") prevents confident wrong answers.
- A stylistically varied case. If your real inputs will vary in length, phrasing, or formality, the examples should too.
Similarity-based selection
For tasks where you have a large example bank, retrieval-augmented few-shot selection dynamically picks the examples most similar to the live input using embedding cosine similarity (e.g. with a sentence-transformer or your provider's embedding API). This works well for structured extraction, classification, and translation — the model gets examples that look like the real query rather than generic representatives.
Standard selection methods studied in the over-prompting research include random sampling, semantic embedding similarity, and TF-IDF vector similarity. A combined stratified approach — picking examples that together cover the label distribution — outperformed any single method, especially at avoiding the over-prompting failure mode.
Labelling quality matters more than count
A 2024 Cleanlab study on reliable few-shot prompt selection found that labelling errors in examples are one of the biggest drivers of few-shot underperformance — more impactful, on average, than choosing suboptimal examples. Before you add more examples, check that the ones you have are unambiguously correct and represent the output you actually want.
Why example order matters (and what to do about it)
Order sensitivity is one of the most underappreciated failure modes in few-shot prompting. A February 2025 paper ("The Order Effect", arxiv:2502.04134) confirmed across multiple closed-source models that shuffling the same examples could produce measurable accuracy differences, with the degree of sensitivity varying by model and task.
Primacy and recency bias
Models exhibit both primacy bias (early examples receive more weight) and recency bias (the last example before the live query exerts a disproportionate pull). Different models lean different ways. The practical consequence: if your final example happens to have an unusual characteristic — an exceptionally short output, a rare label, an unusual format — the model may over-index on that pattern.
- For classification tasks: interleave labels rather than grouping all examples of the same class together. Five positive examples followed by five negative examples will bias toward whatever label came last.
- For format-sensitive tasks: put your most representative, typical example last — it anchors the model's immediate working context.
- For high-stakes deployments: test multiple orderings (at least 3–5 permutations) and either pick the best or average results across orderings.
Spotting diminishing returns in practice
The fastest way to find the right count for your specific task is to run an evaluation over a held-out set. Add one example at a time (0, 1, 2, 3, …), measure task accuracy or quality on the eval set, and plot the curve. Diminishing returns appear as a flattening slope; over-prompting appears as a downward tick.
import itertools
from your_llm_client import call_model # replace with your client
examples = [
{"input": "...", "output": "..."},
# add more
]
eval_set = [("...", "...")] # (input, expected_output) pairs
def build_prompt(n_examples: int, input: str) -> str:
shots = "".join(
f"Input: {e['input']}\nOutput: {e['output']}\n\n"
for e in examples[:n_examples]
)
return f"{shots}Input: {input}\nOutput:"
for n in range(0, len(examples) + 1):
correct = sum(
call_model(build_prompt(n, inp)).strip() == exp
for inp, exp in eval_set
)
print(f"n={n}: {correct}/{len(eval_set)} ({100*correct/len(eval_set):.1f}%)")If you don't have a formal eval set yet, a cheaper heuristic: run your prompt on 10–15 representative real inputs at different example counts. If the outputs look the same at N=4 and N=8, stop at N=4. Trust inspection over intuition.
Watch for these over-prompting warning signs in model output: the model starts echoing the format of examples too literally (copying phrasing instead of adapting), it refuses edge cases that examples didn't cover (over-constrained pattern), or accuracy on inputs unlike the examples drops compared to zero-shot (retrieval interference).
Going deeper
Chain-of-thought examples
When your task requires multi-step reasoning, each example should show the reasoning chain, not just the final answer. This is chain-of-thought (CoT) few-shot prompting. The count guidance still applies — 3 to 5 well-chosen CoT examples typically outperform 8 shallow ones — but the quality bar for each example is higher: the reasoning trace must be correct and legible.
Many-shot prompting
Google DeepMind's 2024 NeurIPS paper on many-shot in-context learning showed that scaling to hundreds or thousands of examples continues to improve performance on difficult tasks — but only when you have a truly long context window (128k+ tokens), high-quality examples at scale, and a task that genuinely benefits from coverage breadth (complex classification, rare-language translation, code generation with many idioms). For the vast majority of production tasks, the 3–6 example sweet spot holds.
When to leave few-shot behind
Few-shot prompting is a fast, zero-infrastructure way to communicate a pattern. But if you're consistently needing 10+ examples to get reliable results, or if you're using the same large example set on every single call, you should evaluate fine-tuning: the examples get baked into the weights, your runtime prompt shrinks dramatically, and per-call cost drops. Think of few-shot as a prototype tool — if it works at 4 examples, ship it; if you need 20, it's a signal to invest in fine-tuning.
Format consistency across examples
Whatever format you establish in example one — JSON structure, delimiter choice, capitalisation of labels, line breaks between fields — must be identical in every subsequent example and in the live prompt. Inconsistency in the demonstration set is one of the most reliable ways to produce inconsistent outputs. The model will treat formatting variation as a signal about acceptable variance, then reproduce it unpredictably.
FAQ
How many few-shot examples should I start with?
Three is a good default: one simple case, one harder case, and one edge case. This covers the basic pattern and at least one failure mode without burning significant token budget. Measure quality and add more only if the eval curve is still rising.
Can too many few-shot examples hurt performance?
Yes. A 2025 paper titled "The Few-shot Dilemma" showed that excessive domain-specific examples degraded performance on GPT-4o, DeepSeek-V3, LLaMA-3, and Mistral. The effect (called over-prompting) was worst when all examples were very similar to each other, creating an over-constrained pattern the model couldn't generalise beyond.
Does the order of few-shot examples change the output?
Yes, sometimes significantly. Research published in 2025 found that reordering the same examples could swing classification accuracy from near state-of-the-art to near random on some models. The safe practices are: interleave labels for classification tasks, put your most representative example last, and test multiple orderings during development.
Should I use examples similar to the live input or diverse examples?
Both approaches have merit. Static diverse examples teach the model the full range of the task. Dynamically retrieved similar examples (via embedding similarity) give the model a closer analogy to the specific input at hand. For high-volume pipelines, a hybrid works well: 1–2 static anchor examples plus 2–3 dynamically retrieved ones.
How do I know if I need fine-tuning instead of few-shot examples?
If you consistently need more than 10 examples to get reliable results, or you are passing the same large example block on every single API call, fine-tuning is worth evaluating. The break-even depends on volume: at millions of calls per day, eliminating 8 examples from each prompt can fully pay for a fine-tuning run within days.
Do few-shot examples work the same way on all models?
No. Larger, more capable models (GPT-4-class, Claude 3-class) are more order-robust and generally need fewer examples to catch the pattern. Smaller models are more sensitive to order and typically need more examples to reliably generalise. Always test your example set on the specific model you are deploying.