Overview
Terminal-Bench is a benchmark for testing how well AI agents handle real terminal tasks, from compiling code to setting up servers and training models. It has two parts: a dataset of tasks and an execution harness that connects a language model to a sandboxed terminal environment.
It is built for people working on LLM agents, benchmarking frameworks, or system-level reasoning. Each task ships with an English instruction, a test script that checks whether the agent succeeded, and a reference "oracle" solution. You run everything through the tb command-line tool.
As an evaluation and benchmark harness, it gives you a reproducible task suite and a runner so you can score agents the same way each time. It is currently in beta with around 100 tasks, and there is a public leaderboard you can submit to.
What it does
- Two-part design: a dataset of terminal tasks plus an execution harness that runs them
- Each task includes an English instruction, a verification test script, and a reference oracle solution
- Runs agents against a sandboxed terminal environment using Docker
- Single CLI (tb) to run evaluations, with flags for agent, model, dataset name, and version
- Versioned datasets (e.g. terminal-bench-core v0.1.1) tied to a public leaderboard
- Open to contributions of new tasks and benchmark adapters
Getting started
Terminal-Bench ships as a pip package and is driven by the tb CLI. You also need uv and Docker installed to run the harness.
Install the package
Install Terminal-Bench with uv (recommended) or pip.
uv tool install terminal-benchInstall with pip (alternative)
If you prefer pip, install the same package directly.
pip install terminal-benchSee the harness options
The harness connects a model to a sandboxed terminal. View the available run options with the help flag.
tb run --helpRun against the leaderboard dataset
Evaluate an agent and model on Terminal-Bench-Core. Pass the dataset name and version to match the current leaderboard.
tb run \
--agent terminus \
--model anthropic/claude-3-7-latest \
--dataset-name terminal-bench-core \
--dataset-version 0.1.1 \
--n-concurrent 8Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Benchmark how well an LLM agent completes real, end-to-end command-line tasks
- Compare different agents or models on the same reproducible task suite
- Stress-test an agent's system-level reasoning in a sandboxed shell before shipping
- Submit results to the Terminal-Bench leaderboard or contribute new tasks and adapters
How Terminal-Bench compares
Terminal-Bench alongside other open-source benchmark harnesses tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| LM Evaluation Harness | ★ 13k | EleutherAI's framework for few-shot evaluation of language models across 60+ academic benchmarks, used as the backend for many leaderboards. |
| OpenCompass | ★ 7.1k | An LLM evaluation platform that runs models against 100+ datasets covering reasoning, knowledge, coding, and domain tasks, with leaderboards and multi-model support. |
| SWE-bench | ★ 5.2k | A benchmark and containerized harness that tests whether language models can resolve real GitHub issues by generating patches that pass a repository's tests. |
| simple-evals | ★ 4.5k | OpenAI's lightweight library for running standard zero-shot, chain-of-thought benchmarks like MMLU, MATH, and GPQA to measure model accuracy. |
| lmms-eval | ★ 4.2k | An evaluation suite for large multimodal models that runs image, video, and audio benchmarks across many tasks with a unified, reproducible interface. |
| AgentBench | ★ 3.5k | A benchmark that evaluates LLMs as agents across diverse interactive environments such as operating systems, databases, web browsing, and games. |
| HELM | ★ 2.8k | Stanford CRFM's Holistic Evaluation of Language Models framework for reproducible, transparent benchmarking of foundation and multimodal models across many scenarios and metrics. |
| Terminal-Bench | ★ 2.4k | A benchmark and harness for testing AI agents on real terminal tasks |