What Are Emergent Abilities in LLMs? When Scale Unlocks New Skills

Q: Does an emergent ability mean the model is reliable at that task?

No. Emergence describes when a skill *starts being possible at all*, not when it becomes trustworthy. A model that has crossed the threshold on arithmetic still makes mistakes and can fail at things that look trivial to humans. Always evaluate reliability separately from whether the capability has appeared.

See why certain abilities seem to switch on only once a model is big enough, and why scientists argue about whether that's real or a measurement artifact.

INTERMEDIATE11 MIN READUPDATED 2026-06-13

In plain English

An emergent ability is a skill that a large language model does not have when it is small, and then suddenly does have once it grows past a certain size. Tiny and medium models score around chance on the task — basically guessing — and then, somewhere along the size curve, performance jumps from "useless" to "surprisingly good." The skill was never explicitly taught; it appears to switch on as a side effect of scale.

Emergent Abilities — illustration — Emergent Abilities — media.wired.com

Think of learning a foreign language. For months you memorize words and grammar rules and you still can't follow a real conversation — your progress feels flat and discouraging. Then one week something clicks, and you can suddenly understand a spoken dialogue you would have been lost in before. You didn't learn one magic rule on that day; thousands of small pieces finally crossed a threshold where they work together. Emergent abilities in LLMs feel like that: a capability that seems absent, absent, absent — then present.

The classic examples researchers point to are things like doing multi-step arithmetic, following instructions phrased in plain English, answering with step-by-step reasoning, or translating between languages the model was barely shown. A model with a few hundred million parameters flatly fails these. A model a hundred times bigger, trained the same way, handles them.

Why it matters

Emergence matters because it makes large models unpredictable in a specific, important way — and unpredictability is hard to plan, budget, and govern around.

You can't always predict a skill before you pay for the model. Scaling laws let you forecast a model's loss (its raw prediction error) very smoothly as you add data and compute. But a smooth loss curve does not tell you which concrete tasks a model will be able to do. A capability you care about — say, reliable multi-step math — might only appear after you've already spent the money to train the bigger model.
It reframes what "just make it bigger" buys you. If new abilities really do unlock at scale, then scaling isn't only making the model a little better at what it already did — it can hand you qualitatively new behaviors. That argument fueled a lot of the race to train ever-larger models.
It complicates safety and evaluation. A capability that was absent in every model you tested can show up in the next, larger one. If some of those surprise capabilities are risky (better persuasion, better tool use, better code), you'd like to know before deployment — and emergence says you might not.
It's genuinely contested. A 2023 rebuttal argued much of the "sudden" emergence is an illusion created by how we measure, not a real phase change in the model. Builders need to understand both sides, because the answer changes how much you trust a scary-or-exciting demo.

So whether you're forecasting a training budget, designing an evaluation suite, or just trying to understand why models keep getting suddenly good at new things, emergence is the phenomenon sitting underneath the headlines about scale.

How it works

Nothing in training is changed to produce an emergent ability. The recipe is the same at every size: predict the next token over a huge pile of text. What changes is scale — usually measured as the training compute (a combination of model size in parameters and the amount of training data). You plot a task's score against scale and look at the shape of the curve.

For most things, that curve is smooth: bigger models are gradually, predictably better. The interesting case is when the curve is flat at chance level across many model sizes and then bends sharply upward past some point. That sharp bend is what gets called emergence.

// The emergence pattern, step by step

Same recipenext-token predictionScale upmore compute + dataFlat at chancesmall models fail the taskThresholda critical scaleSharp jumpskill appears

Why might a skill stay hidden, then appear?

One intuitive story: many useful tasks are compositions of sub-skills. To do a two-step word problem, the model needs to parse the question, recall the right operation, and carry out the arithmetic without slipping — all in one pass. If each sub-skill has only a moderate chance of firing correctly, the combined task fails almost every time until each piece becomes reliable. Then, once all the pieces cross their own quiet thresholds, the whole chain succeeds, and from the outside it looks like one ability switched on at once.

A second story is simply that bigger models form richer internal representations — more of the world's structure gets compressed into the weights — and some tasks need a certain richness before they're solvable at all. Below that richness the model has no foothold; above it, the task becomes learnable from the same data it always saw.

Concrete examples of emergent abilities

The 2022 paper and follow-ups collected dozens of tasks that show this flat-then-jump shape. A few that are easy to picture:

Ability	What small models do	What appears at scale
Multi-step arithmetic	Get multi-digit addition or word problems wrong almost every time	Carry digits and chain steps reliably enough to be useful
Instruction following	Ignore or misread plain-English task descriptions	Actually do what a freshly-phrased instruction asks, even unseen ones
Chain-of-thought reasoning	Get worse if asked to 'think step by step' — the steps are noise	Get better when prompted to reason out loud before answering
Word-in-context / wordplay	Score around random guessing	Disambiguate meaning, handle puns and analogies

Chain-of-thought is the most striking one. On a small model, asking it to write out its reasoning hurts — it just generates more wrong text. On a large enough model, the same prompt helps, sometimes dramatically. So even a prompting technique can behave as if it emerges: it does nothing, does nothing, then unlocks. This is closely tied to why instruction-tuned chat models (see base vs instruct models) feel so much more capable than their size alone suggests — and it's a big part of how ChatGPT works so well in conversation.

The 'is it a mirage?' debate

Here's where it gets genuinely interesting, and where you should hold the concept loosely. In 2023 a paper titled Are Emergent Abilities of Large Language Models a Mirage? (Schaeffer et al.) argued that many "sudden" jumps are an artifact of the metric, not a real phase change in the model.

The core insight is about harsh, all-or-nothing scoring. Take multi-step arithmetic graded by exact-match accuracy: the answer is only "right" if every digit is correct. A model that gets steadily, smoothly better at producing the right digits will still score zero on exact-match until it's good enough to nail the whole number — at which point its score leaps from near-zero to high. The underlying skill improved smoothly; the metric made it look like a cliff.

// Two readings of the same models

Emergence view

Skill is absent, then present
A real threshold in the model
Sudden, hard to predict
Bigger models gain new abilities
Measured with strict exact-match metrics

Mirage view

Underlying skill improves smoothly
The jump lives in the metric, not the model
Use a softer metric and the cliff vanishes
Bigger models get steadily, predictably better
Measured with partial-credit / continuous metrics

The evidence: when researchers re-scored those same tasks with continuous, partial-credit metrics — for example, distance from the correct number, or per-token correctness instead of exact-match — many of the dramatic cliffs flattened into smooth, predictable curves. That strongly suggests at least some reported emergence was measurement, not magic.

Does that kill the idea? Not entirely. Even if the cause of the jump is the metric, the practical reality often still bites: builders frequently care about the exact-match outcome ("did it produce the correct invoice total or not?"), and on that metric the capability really is unusable below a size and usable above it. The fair summary: emergence is real as a user-facing phenomenon under strict metrics, but it is often not a mysterious phase change inside the model. Treat a claimed emergent ability as a question — which metric, and does it survive a smoother one? — not a settled fact.

What this means for builders

Don't extrapolate a skill from a small model. If a 1B-parameter model can't do your task at all, that is weak evidence about a 100B one. The flat-then-jump shape means small-model failure doesn't predict large-model failure.
Choose metrics that show partial progress. If you only track exact-match, you'll see a cliff and won't know whether you're 5% or 95% of the way there. Add a continuous or partial-credit metric so you can watch a capability approach before it lands. This is the single most useful takeaway from the mirage debate.
Re-test new skills on every model upgrade. A capability absent in your current model may appear in the next, larger one — sometimes a useful one, sometimes a risky one. Bake a capability sweep into your evaluation routine rather than assuming last quarter's limits still hold.
Be skeptical of single-demo 'emergence' claims. A viral example of a model 'suddenly' doing something is not evidence of a threshold. Ask whether it holds across many prompts and under a fair metric.

Going deeper

Once the basics click, a few nuances are worth carrying with you.

Emergence is not the same as understanding. A model crossing a threshold on a benchmark hasn't necessarily learned the concept the way a human does — it may have found a shortcut that generalizes only so far. Strong benchmark performance and brittle real-world behavior can coexist, which is exactly why over-reading a single emergent result is dangerous.

The unit of scale matters. People say "bigger model," but training compute folds together parameter count and training tokens. A model can be larger in parameters yet under-trained, and a smaller model trained on far more data can beat it. So when you read that an ability emerges "at scale," check which scale axis — parameters, data, or total compute — the claim is really about. The choice of x-axis can change where (or whether) a jump appears.

Emergence vs. scaling laws, one more time. These two ideas are constantly conflated, so keep them apart: scaling laws are the smooth, predictable relationship between compute and loss — the part you can extrapolate. Emergence is the claimed surprising, non-smooth appearance of specific task abilities — the part you (allegedly) can't. The mirage critique is essentially the argument that, measured fairly, the second collapses back into the first.

Where to go next. If you want the foundation, read how LLMs work and what scaling laws are — emergence sits right at the seam between them. For the bigger picture of how today's chat models turn raw scale into useful behavior, how ChatGPT works and base vs instruct models show how training choices, not just size, decide which abilities you actually get to use. The honest state of the field: scale clearly unlocks new useful behavior, but how much of that is a true phase change versus an artifact of measurement is still openly debated — and that humility is the right place to leave it.

FAQ

What are emergent abilities in LLMs?

Emergent abilities are skills that small language models can't do at all but larger ones can — performance stays near random guessing across many model sizes, then jumps sharply once the model passes a certain scale. The skill isn't explicitly taught; it appears as a side effect of training a bigger model the same way. Classic examples include multi-step arithmetic, instruction following, and chain-of-thought reasoning.

Are emergent abilities of large language models real or a mirage?

Both views have evidence. A 2023 paper showed that many sudden 'jumps' are an artifact of harsh, all-or-nothing metrics like exact-match accuracy: re-score the same models with a continuous, partial-credit metric and the cliff often flattens into a smooth curve. So emergence is frequently a measurement effect rather than a mysterious phase change — but on the strict metrics builders actually care about, a capability really can be unusable below a size and usable above it.

What is the difference between emergent abilities and scaling laws?

Scaling laws describe the smooth, predictable way a model's overall loss (prediction error) improves as you add compute and data — you can extrapolate it. Emergence is the claim that specific task abilities sometimes appear suddenly and non-smoothly, in a way the smooth loss curve didn't predict. In short: scaling laws are the predictable part, emergence is the surprising part — and the mirage critique argues much of that surprise vanishes under fairer metrics.

Do bigger models really gain new skills?

On user-facing, strict metrics, yes — there are tasks a small model fails completely that a larger one handles. Whether that's a true internal threshold or just smooth underlying progress crossing a scoring cutoff is debated. Either way, the practical takeaway holds: you often can't predict which concrete tasks a model can do just from a smaller model's performance.

Why does chain-of-thought prompting only help large models?

Chain-of-thought asks the model to reason out loud before answering. On a small model the extra reasoning is mostly noise and can make answers worse; on a large enough model the same prompt improves accuracy, sometimes a lot. That makes chain-of-thought one of the most striking emergent behaviors — even a prompting technique can appear to 'switch on' only past a certain scale.

Does an emergent ability mean the model is reliable at that task?

No. Emergence describes when a skill starts being possible at all, not when it becomes trustworthy. A model that has crossed the threshold on arithmetic still makes mistakes and can fail at things that look trivial to humans. Always evaluate reliability separately from whether the capability has appeared.

// In plain English

// Why it matters

// How it works

Why might a skill stay hidden, then appear?

// Concrete examples of emergent abilities

// The 'is it a mirage?' debate

// What this means for builders

// Going deeper

// FAQ

// Further reading

// Related