Terminal-Bench

A benchmark and harness for testing AI agents on real terminal tasks

github.com/harbor-framework/terminal-bench★ 2.4k tbench.ai

Overview

Terminal-Bench is a benchmark for testing how well AI agents handle real terminal tasks, from compiling code to setting up servers and training models. It has two parts: a dataset of tasks and an execution harness that connects a language model to a sandboxed terminal environment.

It is built for people working on LLM agents, benchmarking frameworks, or system-level reasoning. Each task ships with an English instruction, a test script that checks whether the agent succeeded, and a reference "oracle" solution. You run everything through the tb command-line tool.

As an evaluation and benchmark harness, it gives you a reproducible task suite and a runner so you can score agents the same way each time. It is currently in beta with around 100 tasks, and there is a public leaderboard you can submit to.

What it does

Two-part design: a dataset of terminal tasks plus an execution harness that runs them
Each task includes an English instruction, a verification test script, and a reference oracle solution
Runs agents against a sandboxed terminal environment using Docker
Single CLI (tb) to run evaluations, with flags for agent, model, dataset name, and version
Versioned datasets (e.g. terminal-bench-core v0.1.1) tied to a public leaderboard
Open to contributions of new tasks and benchmark adapters

Getting started

Terminal-Bench ships as a pip package and is driven by the tb CLI. You also need uv and Docker installed to run the harness.

Install the package

Install Terminal-Bench with uv (recommended) or pip.

bashbash

uv tool install terminal-bench

Install with pip (alternative)

If you prefer pip, install the same package directly.

bashbash

pip install terminal-bench

See the harness options

The harness connects a model to a sandboxed terminal. View the available run options with the help flag.

bashbash

tb run --help

Run against the leaderboard dataset

Evaluate an agent and model on Terminal-Bench-Core. Pass the dataset name and version to match the current leaderboard.

bashbash

tb run \
    --agent terminus \
    --model anthropic/claude-3-7-latest \
    --dataset-name terminal-bench-core \
    --dataset-version 0.1.1 \
    --n-concurrent 8

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Benchmark how well an LLM agent completes real, end-to-end command-line tasks
Compare different agents or models on the same reproducible task suite
Stress-test an agent's system-level reasoning in a sandboxed shell before shipping
Submit results to the Terminal-Bench leaderboard or contribute new tasks and adapters

How Terminal-Bench compares

Terminal-Bench alongside other open-source benchmark harnesses tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
LM Evaluation Harness	★ 13k	EleutherAI's framework for few-shot evaluation of language models across 60+ academic benchmarks, used as the backend for many leaderboards.
OpenCompass	★ 7.1k	An LLM evaluation platform that runs models against 100+ datasets covering reasoning, knowledge, coding, and domain tasks, with leaderboards and multi-model support.
SWE-bench	★ 5.2k	A benchmark and containerized harness that tests whether language models can resolve real GitHub issues by generating patches that pass a repository's tests.
simple-evals	★ 4.5k	OpenAI's lightweight library for running standard zero-shot, chain-of-thought benchmarks like MMLU, MATH, and GPQA to measure model accuracy.
lmms-eval	★ 4.2k	An evaluation suite for large multimodal models that runs image, video, and audio benchmarks across many tasks with a unified, reproducible interface.
AgentBench	★ 3.5k	A benchmark that evaluates LLMs as agents across diverse interactive environments such as operating systems, databases, web browsing, and games.
HELM	★ 2.8k	Stanford CRFM's Holistic Evaluation of Language Models framework for reproducible, transparent benchmarking of foundation and multimodal models across many scenarios and metrics.
Terminal-Bench	★ 2.4k	A benchmark and harness for testing AI agents on real terminal tasks

// Overview

// What it does

// Getting started

Install the package

Install with pip (alternative)

See the harness options

Run against the leaderboard dataset

// When to use it

// How Terminal-Bench compares

Overview

What it does

Getting started

When to use it

How Terminal-Bench compares