Lilian Weng · 2026-06-24 · notable

Lilian Weng: 'Scaling Laws, Carefully' — first new Lil'Log post in 13 months

Lilian Weng walks through scaling laws end-to-end: why Kaplan and Chinchilla reached opposite conclusions, how parameter counting and fit region change the answer, and what data scarcity means for the curves now.

Loss vs compute power-law chart from Kaplan et al., reproduced in Lilian Weng's scaling-laws post.

Lilian Weng returns to Lil'Log after 13 months with a 25-minute walkthrough of scaling laws, Kaplan vs. Chinchilla, and how easily the curves mislead.

What is it?

Lilian Weng's first new Lil'Log post since May 2025 is a long-form survey of neural scaling laws — the empirical curves that predict how training loss falls as model size, dataset size, and compute go up. It covers what the laws actually predict, where they disagree, and what to be careful about before extrapolating.

How does it work?

The post traces the lineage from Hestness and Rosenfeld through Kaplan et al. 2020 (which said model size should grow far faster than data, N_opt proportional to C^0.73) to Chinchilla 2022 (which said model and data should scale together, N_opt proportional to C^0.5). Lilian Weng shows the gap mostly comes from how embedding parameters are counted and which loss region is fit — small log-log changes blow up at trillion-token scale.

Why does it matter?

Compute-optimal allocation decides whether to spend the next training run on a bigger model or more tokens, and that single call swings cost by orders of magnitude. The post also covers data-limited regimes with repeated tokens, where naive scaling laws break down — directly relevant now that frontier labs are running out of fresh web text.

Who is it for?

ML researchers, pretraining engineers, and anyone reasoning about the next compute budget.

Lilian Weng: 'Scaling Laws, Carefully' — first new Lil'Log post in 13 months

What is it?

How does it work?

Why does it matter?

Who is it for?

Sources

Tags