In plain English
ARC-AGI — the Abstraction and Reasoning Corpus — is a benchmark that tests something most AI tests don't: the ability to figure out a brand-new rule on the spot. Each task shows you a handful of small colored grids. On the left, an input grid; on the right, the output grid it turns into. You get two or three of these example pairs, and then one final input with the output left blank. Your job is to look at the examples, work out what transformation is happening, and produce the missing answer.

Think of those visual puzzle books where you see a sequence of shapes and have to draw what comes next. Nobody teaches you the rule beforehand. You stare at the examples, your brain says "ah, each blue square becomes a red circle, and the whole thing flips," and you apply that idea to the new case. That on-the-fly rule discovery is what psychologists call fluid intelligence — reasoning about a problem you've never seen, without leaning on memorized facts.
Most AI benchmarks measure the opposite: crystallized knowledge. Ask a model the capital of a country or to recall a coding pattern, and it answers from things it absorbed during training. ARC-AGI deliberately avoids that. Every task is designed to be novel, so memorizing the internet doesn't help. You have to actually reason.
Why it matters
For years, the standard way to show a model was smart was to post a high score on a knowledge test like MMLU or a coding test like SWE-bench. The problem is that those tests reward scale. Train a bigger model on more data and the scores climb — partly because the model has genuinely learned more, and partly because more of the test's subject matter has leaked into its training data. ARC-AGI was built to break that pattern.
- It isolates reasoning from memorization. The tasks use simple, universal building blocks — counting, symmetry, filling shapes, moving objects — that a human child grasps. Because each puzzle is unique and not published facts, a model can't have "seen the answer" before. A high score means it actually worked something out.
- It exposed a real gap. For a long stretch, models that crushed graduate-level exams scored near zero on ARC-AGI puzzles that ordinary people solve easily. That contrast was a loud signal: fluency and knowledge are not the same thing as flexible reasoning.
- It resists the usual shortcuts. You can't beat ARC-AGI by adding parameters or scraping more text. That makes it a stress test for whether new techniques — like the reasoning/"thinking" models that work through a problem step by step — deliver genuine generalization or just better pattern-matching.
Why does a builder care about a grid-puzzle game? Because it's a clean, honest yardstick. When you compare models for real work, you want to know whether a model can adapt to your novel situation — not just whether it memorized a popular benchmark. ARC-AGI is one of the few tests where a rising score is hard to fake, which is exactly why it became a reference point the whole field watches. (For the broader picture of what benchmarks can and can't tell you, see what are LLM benchmarks.)
How it works
Every ARC-AGI task follows the same shape. You're given a small number of demonstration pairs — an input grid and the output grid it maps to. From those alone you infer the hidden rule, then apply it to a fresh test input to produce the answer. The grids are small (cells colored from a fixed palette), and the rules are things like "reflect the shape," "recolor by size," "continue the pattern," or "keep only the largest object."
Scoring is strict and binary. The predicted output grid must match the correct one exactly — every cell, the right size, the right colors. There's no partial credit for being close. A system is usually allowed a small number of attempts per task, and a task counts as solved only if one attempt is pixel-perfect. The headline number is simply the fraction of tasks solved.
The split that keeps it honest
ARC-AGI's design fights one of the biggest problems in benchmarking: tasks leaking into training data. It splits its puzzles into several sets, and crucially keeps a private evaluation set that is never published. Models are tested against those hidden tasks, so nobody can train on the answers.
This is what makes ARC-AGI resistant to benchmark contamination — the situation where a model scores high only because the test questions were quietly in its training data. With a private set, a strong score has to come from solving genuinely unseen puzzles.
The ARC Prize
ARC-AGI is paired with the ARC Prize, a public competition (run on the hidden tasks, often under tight compute limits) that offers prizes for systems that pass a high accuracy bar. The compute cap matters: it rewards efficient reasoning over brute-forcing millions of guesses. The competition is what turned ARC-AGI from one researcher's benchmark into a focal point the whole community pushes against year after year.
A worked example
Tasks are stored as plain JSON: each grid is a list of rows, and each cell is a number standing for a color. Here's a simplified, illustrative task in that style — not a real ARC-AGI puzzle, just the format — where the rule is "recolor the single non-zero cell from 1 to 2 and leave everything else alone."
{
"train": [
{ "input": [[0,0,0],[0,1,0],[0,0,0]],
"output": [[0,0,0],[0,2,0],[0,0,0]] },
{ "input": [[1,0,0],[0,0,0],[0,0,0]],
"output": [[2,0,0],[0,0,0],[0,0,0]] }
],
"test": [
{ "input": [[0,0,0],[0,0,0],[0,0,1]] }
]
}A solver reads the two train pairs, notices that wherever there was a 1 the output has a 2 (and nothing else changed), and concludes the rule is "turn the 1 into a 2." Applied to the test input, the answer is the same grid with the bottom-right cell changed to 2. Easy for you — and that's the point. The real puzzles layer up subtler ideas (symmetry, counting, object grouping, gravity), but the loop never changes: read the examples, name the rule, apply it.
Why it stays hard for AI
The puzzles look trivial, so why do powerful models struggle? Because nearly everything that makes modern AI strong on other tests works against it here.
| What helps on most benchmarks | Why it doesn't help on ARC-AGI |
|---|---|
| Memorizing huge amounts of text | Every task is novel, so there's nothing to recall |
| Recognizing familiar patterns | The rule changes from one task to the next |
| Being trained on the benchmark's data | The evaluation set is private and unseen |
| Producing fluent, plausible language | Output must be an exact grid, not prose |
| Scaling up model size | More parameters add knowledge, not on-the-fly reasoning |
Two things make a single puzzle genuinely difficult. First, the rule space is open-ended — a transformation might involve symmetry, counting, color logic, motion, object grouping, or several stacked together, and you only get two or three examples to pin it down. Second, the answer must be exact. A solver that grasps the gist but misplaces one cell scores zero. Humans handle this with what Chollet calls core knowledge: innate priors about objects, geometry, counting, and space that we bring to any new scene. Models have to reconstruct that intuition from scratch for every task.
Progress has been real — newer reasoning models that deliberate step by step and search over candidate rules do far better than older ones, and the original ARC-AGI-1 has seen strong systems emerge. But the follow-up ARC-AGI-2 was built precisely because the first set was no longer hard enough, and it remains far from solved while staying easy for people. That persistent human-vs-machine gap is the benchmark's whole reason for existing.
Going deeper
A few nuances separate a casual understanding of ARC-AGI from a real one.
"AGI" is a claim about the benchmark, not the models. The name signals the goal — measuring general, fluid intelligence — not that solving it produces artificial general intelligence. Chollet has been explicit that ARC-AGI is a necessary, not sufficient, signpost: a system could ace it and still be far from general intelligence, but a system that can't touch puzzles humans find easy clearly has a gap. Read the name as "a test pointed at AGI-style reasoning," not "the AGI exam."
Efficiency is part of the score. Because the ARC Prize runs under compute limits, there's a meaningful difference between a system that reasons its way to an answer and one that brute-forces thousands of guesses per task. A method that only wins by spending enormous compute is widely seen as not really capturing the point — fluid intelligence is supposed to be efficient. This is why reported results often pair an accuracy number with a cost-per-task figure, and why the same approach can look impressive or hollow depending on the budget it used.
It's a moving target by design. ARC-AGI is versioned. When models catch up to one set, a harder, human-calibrated successor follows — ARC-AGI-1, then ARC-AGI-2, with the explicit aim of keeping a wide gap between easy-for-humans and hard-for-machines. That makes it different from a static benchmark that simply saturates and gets retired. The frontier moves on purpose.
Where to go next. ARC-AGI is best understood alongside the benchmarks it reacts against. Compare it to knowledge tests like MMLU and agentic-coding tests like SWE-bench to see the knowledge-versus-reasoning split clearly, and read about benchmark contamination and how to read a benchmark score to interpret any of these numbers honestly. The durable lesson from ARC-AGI: a benchmark is only as meaningful as its resistance to shortcuts — and the hardest thing to fake is reasoning about something you've never seen.
FAQ
What does ARC-AGI stand for?
ARC stands for the Abstraction and Reasoning Corpus, and the AGI suffix signals its goal of measuring general, fluid intelligence. It's a benchmark of small grid puzzles where a system must infer a hidden transformation rule from a few examples and apply it to a new grid.
Who created ARC-AGI?
It was introduced by François Chollet in his 2019 paper On the Measure of Intelligence. The paper argued that intelligence should be measured by how efficiently a system handles novel problems, and ARC is the concrete puzzle set built to test that idea.
Why is ARC-AGI so hard for AI when it's easy for humans?
The puzzles are deliberately novel, so a model can't lean on memorized knowledge or training-data patterns — the usual strengths of large models. Each task needs fresh, on-the-spot reasoning about objects, shapes, and rules, and the answer grid must match exactly, with no partial credit.
What is the ARC Prize?
The ARC Prize is a public competition built around ARC-AGI, run on its hidden evaluation tasks and usually under compute limits, with prizes for systems that reach a high accuracy bar. The compute cap rewards efficient reasoning over brute-force guessing.
What is the difference between ARC-AGI-1 and ARC-AGI-2?
ARC-AGI-2 is a harder successor created after strong systems began making real progress on the original ARC-AGI-1. It is calibrated to stay easy for humans while remaining far from solved by machines, preserving the human-versus-AI gap the benchmark is meant to expose.
Does a high ARC-AGI score mean a model is AGI?
No. The name describes the benchmark's target — fluid, general reasoning — not a finish line. Chollet frames it as a necessary signpost, not a sufficient one: failing it reveals a reasoning gap, but passing it doesn't by itself prove artificial general intelligence.