What Are AI Scaling Laws? Why Bigger Models Got Smarter

Q: Why was Llama 3 trained on far more than 20 tokens per parameter?

The 20:1 Chinchilla ratio minimises *training* cost only. Llama 3's 8B model was trained on ~15 trillion tokens (nearly 2,000 tokens per parameter) because a small, heavily-trained model is cheaper to *run* for the billions of inferences that follow. Labs trade extra one-time training compute for permanently cheaper inference — sometimes called inference-optimal training.

Understand the empirical laws that told labs how big to build — and why more data, parameters, and compute keep paying off predictably.

INTERMEDIATE11 MIN READUPDATED 2026-06-12

In plain English

A scaling law is a discovered pattern that says: if you make an AI model bigger, feed it more data, and spend more computing power training it, the model gets predictably better — and you can chart exactly how much better before you spend a dollar.

Here is the surprising part. The improvement isn't random or lucky. When researchers plotted model quality against scale, they got a smooth, ruler-straight line — the kind of clean curve you almost never see in messy real-world data. That straight line is the scaling law. It turned "how good will our next model be?" from a guess into a forecast.

An everyday analogy: think of baking bread at different scales. A home baker who doubles the flour, doubles the yeast, and uses a bigger oven gets a loaf that is reliably better and bigger — not double the quality, but a steady, knowable improvement. Scaling laws are the recipe card that tells AI labs the exact ratio of flour (data) to oven size (compute) to dough (model size) for the best loaf at any budget. Get the ratio wrong and you waste ingredients.

Why it matters

Training a frontier model is one of the most expensive engineering projects on Earth. As of mid-2026, a single frontier training run can cost in the hundreds of millions of dollars, tie up tens of thousands of GPUs, and run for months. You don't get a do-over. Before scaling laws, deciding how big to build a model was educated guesswork — and a wrong guess burned a fortune.

Scaling laws solved the planning problem. A lab can train a handful of small, cheap models, measure how their quality improves with scale, fit a curve, then extrapolate to predict how good a 100x-larger model will be — before committing the budget. That single capability is why the AI race looks the way it does: labs raise billions, build giant data centres, and confidently announce the next model will be smarter, because the math says so.

Investors get a forecast: more compute reliably buys more capability, which justifies the spend.
Researchers get a yardstick: a new technique is only interesting if it beats the scaling-law line, not just the old model.
Engineers get a budget allocator: scaling laws say exactly how to split a fixed compute budget between a bigger model and more training data.
You, the user get the payoff: every jump from one model generation to the next is, in large part, a scaling law being cashed in.

Scaling is also why models are so hungry for hardware. The compute these laws demand is what makes GPUs essential to LLMs — the curve only bends in your favour if you can actually buy the floating-point operations to ride it.

How it works

Scaling laws connect three knobs to one outcome. The three knobs are N (the number of parameters — the model's size), D (the dataset size, measured in tokens), and C (the total compute spent training, roughly proportional to N × D). The outcome is loss — a single number measuring how wrong the model's next-token guesses are. Lower loss is better.

// Three knobs feed one number

Pre-training losslower = better

Parameters (N)model size

Data (D)training tokens

Compute (C)~ N x D

The landmark 2020 paper Scaling Laws for Neural Language Models by Kaplan and colleagues at OpenAI found that loss falls as a power law in each knob. In plain terms: every time you multiply a knob by 10, loss drops by a fixed, predictable amount. Plot it on a log-log chart and you get a straight line that held over more than seven orders of magnitude of scale — an astonishingly clean result for empirical science.

A power law looks like this. Don't panic at the symbols — read it as a sentence below.

the scaling law, schematicallytext

Loss(N, D)  =   A / N^alpha   +   B / D^beta   +   E
                  |               |              |
                  |               |              \__ irreducible error
                  |               \_________________ data term
                  \_________________________________ model-size term

# alpha, beta  : how fast loss falls as you grow N or D (the slopes)
# A, B, E      : constants fitted from small experiments
# E            : a floor you can never beat (the entropy of language itself)

Read it as: loss shrinks if you grow the model (the N term), or grow the data (the D term), but it can never drop below a floor E — the irreducible randomness in language no model can predict away. The exponents alpha and beta are measured, not assumed, and they are small (well under 1), which is why each 10x of scale buys a steady-but-shrinking improvement rather than a miracle.

Crucially, the law also tells you how to split a budget. If you have a fixed amount of compute C, there's a single best way to divide it between making the model bigger (N) and training it longer on more data (D). Spend too much on size and you starve it of data; spend too much on data and the model is too small to absorb it. The sweet spot is the compute-optimal point — and finding it was the whole story of the next big paper.

The Chinchilla correction

For a couple of years after Kaplan's paper, labs read the law one way: make the model as big as possible. Models ballooned — GPT-3 hit 175 billion parameters, DeepMind's Gopher hit 280 billion — but they were trained on relatively modest amounts of data. Everyone over-invested in size.

In 2022, DeepMind's paper Training Compute-Optimal Large Language Models — universally called the Chinchilla paper — showed this was a mistake. The team trained over 400 models at different size/data splits and found that, for a fixed compute budget, model size and data should grow in roughly equal proportion. Their rule of thumb: about 20 training tokens per parameter.

// Same compute, two strategies

Pre-Chinchilla (Gopher)

280B parameters
~300B tokens
~1 token per parameter
Big model, starved of data
Loses the benchmark

Chinchilla

70B parameters
1.4T tokens
~20 tokens per parameter
Smaller model, far more data
Wins at 4x lower cost

The proof was dramatic. Chinchilla had only 70 billion parameters — a quarter the size of the 280B Gopher — yet beat Gopher, GPT-3, and every comparable model across a wide range of tasks, reaching a state-of-the-art ~67.5% on the MMLU knowledge benchmark. A smaller model won simply because the compute was split correctly. That counter-intuitive result reshaped how every lab budgets a training run.

Doing the math yourself

You don't need a supercomputer to play with these numbers. Two back-of-the-envelope formulas capture most of the intuition: the Chinchilla data target (20 tokens per parameter) and the classic compute estimate (C ≈ 6 × N × D floating-point operations for a dense transformer).

scaling_napkin_math.pypython

def chinchilla_optimal_tokens(params: int) -> int:
    """Compute-optimal training tokens, ~20 per parameter."""
    return 20 * params

def training_flops(params: int, tokens: int) -> float:
    """Rough FLOPs to train a dense transformer: C ~= 6 * N * D."""
    return 6 * params * tokens

# A compute-optimal 8B model
N = 8_000_000_000
D_opt = chinchilla_optimal_tokens(N)        # 160 billion tokens
print(f"Chinchilla-optimal data: {D_opt/1e9:.0f}B tokens")
print(f"Training cost: {training_flops(N, D_opt):.2e} FLOPs")

# What labs ACTUALLY ship for inference efficiency: way more data
D_real = 15_000_000_000_000                  # 15 trillion tokens
ratio = D_real / N
print(f"Real ratio: {ratio:.0f} tokens/param  (~{ratio/20:.0f}x past optimal)")

Run that and you'll see the gap between theory and shipping practice. The compute-optimal 8B model wants ~160 billion tokens. Yet Meta's Llama 3 8B was trained on 15 trillion tokens — roughly 1,875 tokens per parameter, almost 100x past the Chinchilla point. Why would they 'waste' all that training compute?

Because Chinchilla optimises the training bill, but ignores the inference bill. A widely-used small model is run billions of times after training, and a smaller model is cheaper to run every single time. So labs happily spend extra training compute to push more knowledge into a smaller body, making the model cheaper to serve forever after. This trade-off is exactly why a tiny open model can feel shockingly capable — it was deliberately overtrained.

Scaling in mid-2026: new axes

The original scaling laws were about pre-training. As of mid-2026, the picture has gotten richer, because two of the three knobs are running into walls — and a brand-new knob has appeared.

The data wall

Scaling law D needs more tokens, but the supply of high-quality human text is finite. Multiple analyses put the exhaustion of the public high-quality text web somewhere in the 2026–2028 window at current consumption rates. Labs are responding with synthetic data, multimodal data (images, audio, video), and heavier filtering — but "just add more web text" is no longer a free lever.

The new axis: test-time compute

The biggest shift since 2024 is that compute spent while answering — not just while training — also follows a scaling law. Starting with OpenAI's o1 and accelerated by DeepSeek's R1 in early 2025, reasoning models were shown to get better at hard problems the longer they're allowed to "think" (generate a long internal chain of reasoning) before answering. DeepSeek R1, for instance, lifted its score on the AIME math contest from the teens to the 70s by reasoning longer. This is test-time (or inference-time) scaling.

// Three scaling axes you can buy capability with

Bigger modelsmore parameters (N)More datamore tokens (D)More thinkingtest-time compute↺ repeat

As of mid-2026, the frontier reflects all three axes. Models like Claude Opus 4.x, GPT-5.x, and Gemini 3.x ship with extended-thinking modes that spend more inference compute on demand — letting you trade latency and cost for accuracy at runtime. The headline benchmark gains of the last year came less from raw bigger-is-better pre-training and more from this new reasoning axis.

Going deeper

Once you're comfortable with the basics, here's the more nuanced reality that practitioners argue about.

The exponents were not gospel

In 2024, Epoch AI published a careful replication attempt of the Chinchilla paper and found that the original authors' fitted scaling-law parameters fit their own reconstructed data poorly, with implausibly tight confidence intervals. The corrected fit still landed near the practical ~20 tokens-per-parameter rule, so the headline survived — but it's a reminder that these are empirical curves fitted to noisy experiments, not laws of physics. Treat the exact exponents as best estimates, not constants of nature.

Architecture changes the constants

Scaling laws are usually derived for dense transformers, where every parameter fires on every token. Mixture-of-Experts models break that assumption: only a fraction of parameters activate per token, so a model can have a huge total parameter count but a small active count, shifting the cost curves dramatically. See Mixture-of-Experts explained for why most frontier models in 2026 are sparse, and how that changes the scaling arithmetic. Efficiency tricks like FlashAttention similarly bend the constant factors without changing the underlying power-law shape.

Loss is not the same as usefulness

Scaling laws predict next-token loss with eerie precision. But the things people actually care about — reasoning, factual accuracy, refusing to hallucinate — relate to loss only loosely and sometimes appear as sudden jumps ("emergent abilities") rather than smooth curves. A model can sit exactly on its predicted loss line and still surprise you, for better or worse, on a downstream task. The line tells you the model got better at predicting text; translating that into better at your job still takes evaluation, fine-tuning, and good prompting.

FAQ

What is the Chinchilla scaling law in simple terms?

It's DeepMind's 2022 finding that, for a fixed amount of training compute, you should grow the model's size and its training data in roughly equal proportion — about 20 training tokens for every parameter. Their 70B-parameter Chinchilla model, trained on 1.4 trillion tokens, beat models four times its size that had been trained on too little data.

Why are bigger AI models better?

Because of scaling laws: empirically, a model's prediction error falls as a smooth power law as you add parameters, data, and compute. More parameters give the model more capacity to store patterns, and the relationship is predictable enough that labs can forecast how good a larger model will be before training it. The catch is diminishing returns — each 10x of scale buys a steady but shrinking improvement, never a floor of zero error.

Are scaling laws still true in 2026, or have they hit a wall?

Pre-training scaling still works, but two of its inputs are getting expensive: high-quality public text is projected to run low around 2026-2028, and compute costs are enormous. The big shift is a third axis — test-time (inference) compute, where reasoning models like the o1 and DeepSeek R1 lineages get better the longer they think. As of mid-2026, frontier gains lean heavily on this reasoning axis rather than purely bigger pre-training.

What is the difference between Kaplan and Chinchilla scaling laws?

Kaplan et al. (2020) established that loss follows clean power laws in model size, data, and compute. Chinchilla (Hoffmann et al., 2022) corrected the budget split: Kaplan-era practice over-invested in giant models with too little data, while Chinchilla showed model size and data should scale together at roughly 20 tokens per parameter for compute-optimal training.

Why was Llama 3 trained on far more than 20 tokens per parameter?

The 20:1 Chinchilla ratio minimises training cost only. Llama 3's 8B model was trained on ~15 trillion tokens (nearly 2,000 tokens per parameter) because a small, heavily-trained model is cheaper to run for the billions of inferences that follow. Labs trade extra one-time training compute for permanently cheaper inference — sometimes called inference-optimal training.

// In plain English

// Why it matters

// How it works

// The Chinchilla correction

// Doing the math yourself

// Scaling in mid-2026: new axes

The data wall

The new axis: test-time compute

// Going deeper

The exponents were not gospel

Architecture changes the constants

Loss is not the same as usefulness

// FAQ

// Further reading

// Related