AI/TLDR

Terminal-Bench

A benchmark and harness for testing AI agents on real terminal tasks

Overview

Terminal-Bench is a benchmark for testing how well AI agents handle real terminal tasks, from compiling code to setting up servers and training models. It has two parts: a dataset of tasks and an execution harness that connects a language model to a sandboxed terminal environment.

It is built for people working on LLM agents, benchmarking frameworks, or system-level reasoning. Each task ships with an English instruction, a test script that checks whether the agent succeeded, and a reference "oracle" solution. You run everything through the tb command-line tool.

As an evaluation and benchmark harness, it gives you a reproducible task suite and a runner so you can score agents the same way each time. It is currently in beta with around 100 tasks, and there is a public leaderboard you can submit to.

What it does

  • Two-part design: a dataset of terminal tasks plus an execution harness that runs them
  • Each task includes an English instruction, a verification test script, and a reference oracle solution
  • Runs agents against a sandboxed terminal environment using Docker
  • Single CLI (tb) to run evaluations, with flags for agent, model, dataset name, and version
  • Versioned datasets (e.g. terminal-bench-core v0.1.1) tied to a public leaderboard
  • Open to contributions of new tasks and benchmark adapters

Getting started

Terminal-Bench ships as a pip package and is driven by the tb CLI. You also need uv and Docker installed to run the harness.

Install the package

Install Terminal-Bench with uv (recommended) or pip.

bashbash
uv tool install terminal-bench

Install with pip (alternative)

If you prefer pip, install the same package directly.

bashbash
pip install terminal-bench

See the harness options

The harness connects a model to a sandboxed terminal. View the available run options with the help flag.

bashbash
tb run --help

Run against the leaderboard dataset

Evaluate an agent and model on Terminal-Bench-Core. Pass the dataset name and version to match the current leaderboard.

bashbash
tb run \
    --agent terminus \
    --model anthropic/claude-3-7-latest \
    --dataset-name terminal-bench-core \
    --dataset-version 0.1.1 \
    --n-concurrent 8

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Benchmark how well an LLM agent completes real, end-to-end command-line tasks
  • Compare different agents or models on the same reproducible task suite
  • Stress-test an agent's system-level reasoning in a sandboxed shell before shipping
  • Submit results to the Terminal-Bench leaderboard or contribute new tasks and adapters

How Terminal-Bench compares

Terminal-Bench alongside other open-source benchmark harnesses tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
LM Evaluation Harness★ 13kEleutherAI's framework for few-shot evaluation of language models across 60+ academic benchmarks, used as the backend for many leaderboards.
OpenCompass★ 7.1kAn LLM evaluation platform that runs models against 100+ datasets covering reasoning, knowledge, coding, and domain tasks, with leaderboards and multi-model support.
SWE-bench★ 5.2kA benchmark and containerized harness that tests whether language models can resolve real GitHub issues by generating patches that pass a repository's tests.
simple-evals★ 4.5kOpenAI's lightweight library for running standard zero-shot, chain-of-thought benchmarks like MMLU, MATH, and GPQA to measure model accuracy.
lmms-eval★ 4.2kAn evaluation suite for large multimodal models that runs image, video, and audio benchmarks across many tasks with a unified, reproducible interface.
AgentBench★ 3.5kA benchmark that evaluates LLMs as agents across diverse interactive environments such as operating systems, databases, web browsing, and games.
HELM★ 2.8kStanford CRFM's Holistic Evaluation of Language Models framework for reproducible, transparent benchmarking of foundation and multimodal models across many scenarios and metrics.
Terminal-Bench★ 2.4kA benchmark and harness for testing AI agents on real terminal tasks