What Is Inspect? AISI's AI Evaluation Framework

Q: What is the difference between a solver and a scorer in Inspect?

A solver defines *how the model tackles a task* — from a single prompt to a full agent loop with tools. A scorer defines *how the result is graded* against the target, using exact match, pattern matching, or a model-graded judge. The solver produces the answer; the scorer rates it.

You will understand how Inspect structures rigorous frontier and safety evaluations using datasets, solvers, and scorers.

INTERMEDIATE10 MIN READUPDATED 2026-06-14

OFFICIAL SITEinspect.aisi.org.uk OFFICIAL SITEaisi.gov.uk UKGovernmentBEIS/inspect_ai2.2k

In plain English

Inspect is an open-source framework for testing AI models — built and maintained by the UK AI Security Institute (AISI). You use it to write a structured exam for a model, run that exam automatically, and get back graded results you can trust. The questions, the way the model is asked to answer, and the way each answer is graded are all defined as code, so the whole test is repeatable.

Inspect (AISI) — illustration — Inspect (AISI) — comet.com

Think of a driving test. There is a fixed list of things to check (the dataset of tasks), an examiner who tells you what to do at each step and watches how you do it (the solver), and a scoring sheet that turns your performance into a pass or fail (the scorer). Inspect gives you exactly those three pieces for AI models. You supply the tasks and the rules; Inspect runs the model through them and tallies the score.

The part that makes Inspect special is what it was built to test. It is not aimed at simple question-and-answer quizzes. It is designed for frontier and safety evaluations — checking whether a powerful model can do something dangerous, follow a multi-step plan, or use real tools. To do that safely, Inspect can run the model as an agent inside a locked-down sandbox, like giving a candidate a private, isolated room with its own computer instead of letting it touch the real network.

Why it matters

Most evaluation tools were built to answer one question: how smart is this model? They run academic benchmarks and report an accuracy number. Inspect was built to answer a harder, more specific question: what can this model actually do, and could any of it be dangerous? That shift in goal changes everything about how the framework is shaped.

A general benchmark runner is happy to ask a model a multiple-choice question and check the letter it picked. A safety evaluation often cannot work that way. Testing whether a model can carry out a cyber-attack, run a long research task, or be talked into harmful behavior means letting it act over many turns, use tools, and run code — and then judging the messy, open-ended result. Inspect is built from the ground up for that kind of test.

Dangerous-capability testing needs a sandbox. If you want to know whether a model can exploit a vulnerable server, you cannot point it at the real internet. Inspect runs agent tasks inside isolated containers, so the model can try things safely while you watch.
Safety results must be reproducible and auditable. When a government or lab reports that a model is or isn't dangerous, the test has to be re-runnable by others. Defining datasets, solvers, and scorers as code makes the whole evaluation a shareable artifact, not a one-off experiment.
Open tasks need flexible grading. A right answer to 'plan and execute this task' isn't a single string. Inspect lets you mix exact-match checks, pattern checks, and model-graded scoring in the same suite.
Provider independence. The same task can run against models from different providers behind one interface, so comparisons stay apples-to-apples.

If you build or audit AI systems and you care about behavior under realistic, multi-step, tool-using conditions — not just a quiz score — this is the kind of tool you reach for. It sits alongside, not against, your everyday LLM evals: the eval mindset is the same, but Inspect is tuned for the high-stakes, agentic end of the spectrum.

How it works

An Inspect evaluation is built from three core building blocks, plus an optional sandbox. Once you understand these four ideas, the whole framework makes sense.

Dataset — the collection of tasks. Each item is a sample with an input (the question or starting state) and usually a target (the correct answer or success condition).
Solver — the recipe for how the model tackles each sample. A solver can be as simple as 'send the prompt, read the reply' or as rich as a full agent loop that plans, calls tools, and reacts to results over many turns.
Scorer — the grader. It looks at what the model produced and the target, and assigns a score. Scorers range from exact match, to pattern matching, to a model-graded judge that reads the answer and rates it.
Sandbox — an isolated environment (typically a container) where an agent solver can safely run commands, edit files, or hit a mock service without touching the real world.

A task ties these together: a dataset plus a solver plus a scorer. You run a task, Inspect walks every sample through the solver, grades each result with the scorer, and aggregates everything into a report you can open in its viewer.

// One sample flowing through an Inspect task

Datasetload samples (input + target)Solvermodel answers / agent actsSandboxisolated tool + code executionScorergrade vs targetResultsaggregate + view

Here is the shape of a minimal task in Inspect's Python API. The decorators and helper names are illustrative of the style — the point is that a dataset, a solver, and a scorer come together in one Task.

a minimal Inspect taskpython

from inspect_ai import Task, task
from inspect_ai.dataset import Sample
from inspect_ai.solver import generate
from inspect_ai.scorer import match

@task
def capital_cities():
    return Task(
        # 1) DATASET: each sample has an input and a target answer.
        dataset=[
            Sample(input="What is the capital of France?", target="Paris"),
            Sample(input="What is the capital of Japan?", target="Tokyo"),
        ],
        # 2) SOLVER: just ask the model to generate an answer.
        solver=generate(),
        # 3) SCORER: check the reply against the target.
        scorer=match(),
    )

You then run it from the command line against whatever model you point at, and Inspect produces a log you can open in its viewer to inspect every sample, the full transcript, and the score.

running and viewing the evalbash

# Run the task against a chosen model.
inspect eval capital_cities.py --model anthropic/claude-sonnet-4-6

# Open the interactive viewer on the produced logs.
inspect view

Where Inspect goes beyond a quiz

The capital-cities example above is a warm-up. The reason a safety institute built its own framework is the harder case: evaluating a model as an agent that uses tools and acts over many turns. This is where the solver becomes a full loop and the sandbox earns its keep.

The agent loop as a solver

Instead of a single generate-and-grade step, an agent solver gives the model tools — run a shell command, read or write a file, call a search function — and lets it work toward a goal. The model proposes an action, the framework executes it, feeds the result back, and the loop repeats until the task is done or a limit is hit. This is the same agent pattern used across modern AI systems, applied here purely for testing.

// An agent solver loop

Model picks a tool / actionSandbox runs the actionResult fed back to modelGoal met? if not, repeat↺ repeat

Why the sandbox is non-negotiable

If a solver can run shell commands, you must never let it run them on your real machine or the open internet — especially when the whole point is to see whether a model will attempt something harmful. Inspect runs these actions inside an isolated sandbox (commonly a Docker container), so a tested model gets a realistic computer to operate, but one that is fenced off and disposable. After the run, you tear the environment down.

Inspect vs a general benchmark runner

It is easy to lump Inspect in with tools whose job is to run academic benchmarks and print a leaderboard number. They overlap, but the design goals pull in different directions. The table below contrasts the two styles — not to crown a winner, but to show when each fits.

Aspect	General benchmark runner	Inspect (safety-focused)
Primary goal	Compare models on standard tasks	Probe real capabilities and safety risk
Typical task	Fixed Q&A, multiple choice	Multi-turn, tool-using, agentic tasks
Model interaction	Mostly single-shot prompts	Agent loops with tools, over many turns
Grading	Mostly exact / automated match	Mix of exact, pattern, and model-graded
Isolation	Usually none needed	Sandboxed execution is first-class
Best for	Leaderboards, regression on known sets	Frontier evals, dangerous-capability tests

In practice they are complementary. You might use a standard harness to track how a model does on well-known public benchmarks, and reach for Inspect when you need to script a bespoke, multi-step evaluation — say, can this model autonomously complete a realistic engineering task using a terminal? — and grade it with a blend of automated and model-graded checks.

Going deeper

Once the dataset–solver–scorer–sandbox model clicks, the rest of Inspect is about composing those pieces well and reading the results honestly. A few directions worth knowing as you go further.

Composable solvers. Solvers chain together. You can stack a system-prompt step, a few-shot step, a tool-use step, and a self-critique step into one pipeline, then reuse that pipeline across many datasets. This is what lets a single, well-designed agent harness be applied to a whole family of tasks instead of being rewritten each time.

Model-graded scoring and its limits. For open-ended answers, a scorer can call another model to judge the output against a rubric — the LLM-as-a-judge idea. It is powerful but inherits the judge model's biases, so safety-critical evaluations often pair it with deterministic checks and human review rather than trusting a model judge alone.

The viewer and transcripts. Because agent runs are long and messy, reading the transcript matters as much as the final score. Inspect's viewer lets you replay each sample step by step — every tool call, every model message — which is how you catch a model that 'passed' for the wrong reason or a scorer that graded leniently.

Sample size and noise. Agentic evals are expensive and stochastic; the same task can pass on one run and fail on the next. Treat a single number with suspicion, run enough samples, and think about confidence — the same discipline you'd apply when sizing any eval set. A flashy capability claim from three runs is not evidence.

Where to go next. Inspect fits inside the broader practice of building and running an evaluation suite. Start by writing one tiny task end to end, get it running in the viewer, then grow it: add a tool, add a sandbox, swap in a model-graded scorer. The honest takeaway is that the framework is the easy part — the hard part is designing tasks that actually measure the capability or risk you care about, and resisting the urge to read more into the score than the test can support.

FAQ

Who makes Inspect and is it free to use?

Inspect is built and maintained by the UK AI Security Institute (AISI), the UK government body that studies risks from advanced AI. It is open-source and free to use, which is why labs, researchers, and other organizations can run the same rigorous safety evaluations.

What is the difference between a solver and a scorer in Inspect?

A solver defines how the model tackles a task — from a single prompt to a full agent loop with tools. A scorer defines how the result is graded against the target, using exact match, pattern matching, or a model-graded judge. The solver produces the answer; the scorer rates it.

Why does Inspect run agents in a sandbox?

Because safety evaluations often need the model to run real commands or use tools, and you must never let it touch your real machine or the open internet — especially when testing for dangerous behavior. The sandbox is an isolated, disposable environment (often a Docker container) where the model can act safely while you watch.

How is Inspect different from a benchmark runner like lm-evaluation-harness?

A benchmark runner is optimized for comparing models on fixed, mostly single-shot academic tasks and reporting an accuracy number. Inspect is optimized for multi-turn, tool-using, agentic evaluations with sandboxed execution and flexible grading. They are complementary: one for leaderboards, the other for bespoke capability and safety testing.

Can I use Inspect with models from different providers?

Yes. Inspect exposes a single interface that can run the same task against models from different providers, so you can compare them on an equal footing. You point the run at a chosen model and the rest of the task stays the same.

Do I need to test dangerous capabilities to use Inspect?

No. Inspect works perfectly well for ordinary evaluations like question answering or tool use; the capital-cities style task is a few lines of code. The sandboxing and agent features are there when you need them, but you can start with simple datasets and scorers and grow from there.

// In plain English

// Why it matters

// How it works

// Where Inspect goes beyond a quiz

The agent loop as a solver

Why the sandbox is non-negotiable

// Inspect vs a general benchmark runner

// Going deeper

// FAQ

// Further reading

// Related

In plain English

Why it matters

How it works

Where Inspect goes beyond a quiz

Inspect vs a general benchmark runner

Going deeper

FAQ

Further reading

Related