What Is Mechanistic Interpretability? Looking Inside the Black Box

Understand what mechanistic interpretability is trying to do, what features and circuits are, and why it matters for safety.

ADVANCED11 MIN READUPDATED 2026-06-11

In plain English

A modern language model is a few hundred billion numbers — weights — that turn your prompt into the next word. Nobody wrote those numbers by hand. They were grown by training, and the result is a system that clearly works but that nobody can read like source code. Mechanistic interpretability (often shortened to "mech interp") is the field that tries to reverse-engineer those numbers back into something a human can understand: not "the model said X," but which specific internal machinery produced X, and why.

Think of it like a biologist with a microscope, not a doctor reading symptoms. A doctor watches behavior from the outside — the patient has a fever, so something is wrong. A biologist opens the cell, traces which protein binds to which, and explains the fever mechanically. Most AI testing is the doctor: we feed prompts in, watch outputs, and grade them. Mechanistic interpretability is the biologist: it opens the network mid-computation and asks what is each part actually doing?

The whole field rests on a hopeful bet: that the messy soup of weights actually contains discrete, reusable pieces of meaning — and that we can find them. The two pieces it hunts for are features (a direction inside the network that stands for a concept, like "this text is in French" or "the Golden Gate Bridge") and circuits (small wiring diagrams of features that work together to perform a task, like "copy the name mentioned earlier in the sentence").

Why it matters

Almost everything we know about whether a model is safe comes from watching its behavior. We write evals, we red-team it, we check that it refuses the bad requests. But behavioral testing has a hard ceiling: it can only catch problems you thought to test for. If a model is deceptive — behaving well on every test precisely because it's being tested — no amount of more tests will reveal it. That's the nightmare that motivates the field.

Mechanistic interpretability is the bet that we can sidestep that ceiling by reading the model's internals instead of its outputs. If you can find the circuit that represents "I am currently being evaluated" or the feature for "deception," you can check for trouble that no behavioral test would ever surface. It's the difference between a polygraph that watches your face and an MRI that watches your brain.

Safety and alignment researchers want it because behavioral evals can't detect a model that's hiding its intentions. Internals can't lie the way outputs can.
Builders care because it promises real debugging — finding which feature makes your model sycophantic or biased, instead of just patching prompts and hoping.
Regulators and frontier labs lean on it for frontier safety policies: a credible internal audit is stronger evidence than a passing test suite.

What did it replace? Mostly, hand-waving. For years the standard answer to "how does the model do this?" was a shrug and a citation to "it's a black box." Mech interp is the refusal to accept that shrug — an attempt to upgrade AI from alchemy toward something closer to engineering, where you can point at a component and say what it does.

How it works

Start with the obvious idea that turned out to be wrong. You'd hope each neuron inside the network stands for one clean concept — neuron #4,071 fires for "dog," another for "sarcasm." Reality is messier: most neurons are polysemantic, firing for a jumble of unrelated things. One neuron might activate for academic citations, for the color green, and for HTTP requests. That's not a bug — it's how the network crams thousands of concepts into a few thousand neurons.

Superposition: the central obstacle

The reason neurons are polysemantic is superposition: the network packs more distinct features than it has neurons by storing them as overlapping directions in the high-dimensional activation space, rather than one-per-neuron. Because real features are sparse — only a handful are active on any given input — the model can overlay them with tolerable interference, like a radio fitting many stations into one band. Superposition is great for the model and terrible for us: it's exactly why you can't just read concepts off individual neurons. Anthropic's Toy Models of Superposition paper laid this out cleanly on tiny models you can fully understand.

Sparse autoencoders: prying features apart

The breakthrough tool for undoing superposition is the sparse autoencoder (SAE). You take the model's activations at some layer and train a second, much wider network that learns to re-express those activations as a sum of many features, with a hard rule: only a few features may be active at once. That sparsity penalty forces the SAE to discover a dictionary of monosemantic features — directions that each mean one thing. Suddenly you get human-readable features like "the Golden Gate Bridge," "text written in legal language," or "a function is about to be called in code."

// From tangled activations to readable features

Run the modelon lots of textGrab activationsone layer's vectorsTrain an SAEwide + sparseRead the dictionarymonosemantic featuresLabel each featurewhat input fires it?

Features are the nouns of the field; circuits are the verbs. A circuit is a small subgraph of features and attention heads that together implement a behavior. The famous early example is induction heads: a two-head circuit that learns the rule "if the pattern A B appeared earlier, and you just saw A again, predict B." That tiny mechanism is a big chunk of why models can copy names, complete repeated phrases, and do in-context learning at all.

Proving a part actually causes the behavior

Finding a feature isn't enough — correlation isn't causation. To prove a component causes a behavior, researchers run interventions: clamp a feature on or off, or swap activations from one prompt into another ("activation patching"), and watch whether the output changes as predicted. When Anthropic forced the "Golden Gate Bridge" feature to fire, the model started steering every conversation toward the bridge — strong causal evidence that the feature really was that concept, not a coincidence.

The core vocabulary

A handful of terms unlock most papers in the field. Here's the cheat sheet:

Term	Plain meaning
Feature	A direction in activation space that stands for one concept.
Circuit	A small set of features/heads wired together to do one task.
Polysemantic neuron	A single neuron that fires for several unrelated concepts.
Superposition	Storing more features than neurons by overlapping directions.
Monosemantic	A unit (usually an SAE feature) that means exactly one thing.
Activation patching	Swapping internal values between prompts to test causality.
Probe	A small classifier trained to read a concept off activations.

A tiny probe in code

You don't need a frontier lab to taste this. The most accessible experiment is a linear probe on a small open model. Pull hidden states out of the model, then fit a simple classifier and see whether a concept is linearly readable. The library most researchers reach for is TransformerLens, which exposes every internal activation by name.

linear_probe.pypython

import torch
from transformer_lens import HookedTransformer
from sklearn.linear_model import LogisticRegression

model = HookedTransformer.from_pretrained("gpt2-small")

# Two classes of prompts whose concept we want to find a 'direction' for.
pos = ["Paris is the capital of France.", "Tokyo is in Japan."]
neg = ["The cat slept on the warm windowsill.", "He bought three apples."]
LAYER = 6  # which residual-stream layer to read

def rep(text: str):
    # run_with_cache exposes every internal activation by name
    _, cache = model.run_with_cache(text)
    # take the last token's residual-stream vector at LAYER
    return cache["resid_post", LAYER][0, -1].detach().numpy()

X = [rep(t) for t in pos + neg]
y = [1] * len(pos) + [0] * len(neg)  # 1 = 'is a geography fact'

probe = LogisticRegression(max_iter=1000).fit(X, y)
print("train accuracy:", probe.score(X, y))
# probe.coef_ is now a *direction* in activation space for the concept.

With a real dataset (hundreds of examples per class, held-out test set), the probe's accuracy tells you whether the concept lives in the model as a clean direction. That probe.coef_ vector is a primitive feature — the same object an SAE finds automatically, just discovered by hand for one concept you chose. This is also a useful evaluation trick: probing the internals is a different lens than the output-grading metrics most teams rely on.

Interpretability vs behavioral testing

Mechanistic interpretability doesn't replace evals, benchmarks, or red-teaming — it complements them. Each answers a different question:

// Two ways to inspect a model

Behavioral testing

Watches inputs and outputs
Cheap, scalable, everywhere
Catches what you test for
Blind to hidden intent
Evals, benchmarks, red-teaming

Mechanistic interpretability

Watches internal activations
Slow, research-heavy, partial
Can surface untested failures
Aims to detect hidden intent
Features, circuits, SAEs

In practice they form a loop. A behavioral test flags that the model is oddly biased or keeps falling for a jailbreak; interpretability goes looking for the feature or circuit behind it; an intervention confirms the cause; and a new, sharper eval is written to guard against it. Outputs tell you that something is wrong; internals start to tell you why.

Going deeper

Scaling SAEs to real models. Toy models were the proof of concept; the hard part is dictionary-learning at frontier scale. Anthropic's Scaling Monosemanticity work trained SAEs on a production model and pulled out millions of interpretable features — abstract ones for concepts like "code with security vulnerabilities" and "sycophantic praise," not just surface tokens. The open challenges are brutal: SAEs are expensive to train, the right number of features is a guess, and feature splitting means a single coarse feature in a small dictionary fractures into many fine-grained ones as you widen it — so there's no canonical "true" set of features, only a resolution you chose.

Attribution graphs and the circuits frontier. Newer work goes beyond cataloguing features to tracing how they connect across layers into end-to-end computation graphs for a single prompt — sometimes called attribution or replacement graphs. This is where interpretability starts to resemble reading a program: you can watch a feature for "the capital of a state" activate, feed a downstream feature, and produce the answer. It's also where the field is most fragile — these graphs are approximations of an approximation, and validating them is an active research problem.

The dark-matter problem. Current SAEs don't explain everything. There's a stubborn residual — activations that no learned feature accounts for, informally called "dark matter." Maybe it's noise; maybe it's structure too entangled or non-linear for today's linear dictionaries. If a meaningful fraction of the model's computation lives in that residual, then interpretability-based safety guarantees have a hole in them, and honest researchers say so.

Why this is a safety bet, not a solved tool. The strongest version of the case is enumerative safety: if you could enumerate every feature and circuit, you could audit a model for dangerous capabilities directly, the way you'd audit code — a far stronger guarantee than security testing or any behavioral suite. We are nowhere near that. Today's wins are real but local: specific features found, specific circuits explained, specific behaviors steered. The open question that defines the field is whether interpretability scales faster than the models do, or whether the next generation outruns our microscopes. For now, treat it as the most promising path to reading a model's mind — and one that is still mostly a research program, not a product.

FAQ

What is the difference between a feature and a circuit in interpretability?

A feature is a single direction in the network's activation space that stands for one concept — "text in French," "the Golden Gate Bridge." A circuit is a small group of features and attention heads wired together to carry out a task, like copying a name from earlier in the sentence. Features are the building blocks; circuits are the mechanisms built from them.

What is a sparse autoencoder and why is it used for interpretability?

A sparse autoencoder (SAE) is a second network trained to re-express a model's activations as a sum of many features where only a few are active at once. That sparsity forces it to discover monosemantic features — directions that each mean one thing — which untangles the superposition that makes raw neurons unreadable. It's the main tool for pulling human-interpretable concepts out of a model.

Why are neurons in a neural network polysemantic?

Because of superposition. Models need to represent far more concepts than they have neurons, so they store features as overlapping directions across many neurons rather than one concept per neuron. Since only a few features are active on any given input, the model tolerates the overlap — but it means a single neuron ends up firing for several unrelated concepts.

Is mechanistic interpretability the same as explainable AI (XAI)?

Not quite. Traditional explainable AI often produces input-attribution heatmaps ("these words mattered most") or relies on the model's own explanation. Mechanistic interpretability is more ambitious: it reverse-engineers the actual internal algorithm — the features and circuits — rather than describing which inputs were influential or trusting the model's self-report.

Can mechanistic interpretability detect if an AI is deceptive?

That is the long-term hope, not a current guarantee. The motivation is that a deceptive model could pass every behavioral test by design, but its internal activations might still reveal the deception. Researchers have found and steered concept features, but reliably detecting deception in a frontier model is an open problem — current methods explain specific behaviors, not the whole network.

What tools do people use to do mechanistic interpretability?

Common open-source libraries include TransformerLens for inspecting transformer internals by name, and SAE-training toolkits like SAELens. Standard techniques are linear probes, activation patching, and training sparse autoencoders. Most published frontier work comes from labs like Anthropic and from the broader interpretability research community.

// In plain English

// Why it matters

// How it works

Superposition: the central obstacle

Sparse autoencoders: prying features apart

Proving a part actually causes the behavior

// The core vocabulary

// A tiny probe in code

// Interpretability vs behavioral testing

// Going deeper

// FAQ

// Further reading

// Related