AgentBench

Benchmark LLMs as agents across operating systems, databases, web, and games

github.com/THUDM/AgentBench★ 3.5k llmbench.ai/agent

Overview

AgentBench is a benchmark for measuring how well large language models behave as autonomous agents, not just as text generators. Instead of single-shot questions, it puts a model into multi-turn interactive environments and scores whether it can reach a goal. The original release covers 8 distinct environments, including freshly built ones such as Operating System, Database, Knowledge Graph, Digital Card Game, and Lateral Thinking Puzzles, plus tasks recompiled from ALFWorld, WebShop, and Mind2Web.

The project is aimed at researchers and engineers who build or compare agentic LLMs and want a repeatable way to test tool use, planning, and grounded interaction. The current main branch is AgentBench FC (Function Calling), which uses a function-calling style prompt and is integrated with the AgentRL framework. It ships fully containerized deployment for five tasks: alfworld (AF), dbbench (DB), knowledgegraph (KG), os_interaction (OS), and webshop (WS).

As a benchmark harness in the evaluation category, AgentBench focuses on running agents against fixed task suites and reporting scores on a public leaderboard. Older versions (v0.1 and v0.2) remain available as tags if you need the earlier task set or the non-Docker conda workflow.

What it does

Evaluates LLMs as agents across diverse interactive environments rather than static QA
Original suite spans 8 environments: OS, DB, KG, Digital Card Game, Lateral Thinking Puzzles, plus ALFWorld, WebShop, and Mind2Web
AgentBench FC adds a function-calling prompt style integrated with the AgentRL RL framework
Fully containerized deployment for alfworld, dbbench, knowledgegraph, os_interaction, and webshop via Docker Compose
One-command setup brings up task workers, a controller, a Freebase server, and Redis for container allocation
Public leaderboard and Dev/Test splits for reporting and comparing model results

Getting started

AgentBench FC (the current main branch) runs its task environments in containers via Docker Compose. The steps below follow the README's one-command setup.

Get the code

Clone the repository and enter the directory.

bashbash

git clone https://github.com/THUDM/AgentBench.git
cd AgentBench

Pull and build the task images

Some tasks need prebuilt Docker images. Pull MySQL for dbbench and build the OS interaction images.

bashbash

# dbbench
docker pull mysql:8

# os_interaction
docker build -t local-os/default -f ./data/os_interaction/res/dockerfiles/default data/os_interaction/res/dockerfiles
docker build -t local-os/packages -f ./data/os_interaction/res/dockerfiles/packages data/os_interaction/res/dockerfiles
docker build -t local-os/ubuntu -f ./data/os_interaction/res/dockerfiles/ubuntu data/os_interaction/res/dockerfiles

Bring up the stack

Start the controller and task workers with Docker Compose. The webshop environment needs about 16GB of RAM, so make sure your machine has enough resources.

bashbash

docker compose -f extra/docker-compose.yml up

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Compare how different LLMs perform as agents on the same fixed set of interactive tasks
Stress-test a model's tool use and multi-turn planning in OS, database, and web environments
Track agent capability over model versions and report results to the public leaderboard
Provide task environments for training and evaluating function-calling agents alongside AgentRL

How AgentBench compares

AgentBench alongside other open-source benchmark harnesses tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
LM Evaluation Harness	★ 13k	EleutherAI's framework for few-shot evaluation of language models across 60+ academic benchmarks, used as the backend for many leaderboards.
OpenCompass	★ 7.1k	An LLM evaluation platform that runs models against 100+ datasets covering reasoning, knowledge, coding, and domain tasks, with leaderboards and multi-model support.
SWE-bench	★ 5.2k	A benchmark and containerized harness that tests whether language models can resolve real GitHub issues by generating patches that pass a repository's tests.
simple-evals	★ 4.5k	OpenAI's lightweight library for running standard zero-shot, chain-of-thought benchmarks like MMLU, MATH, and GPQA to measure model accuracy.
lmms-eval	★ 4.2k	An evaluation suite for large multimodal models that runs image, video, and audio benchmarks across many tasks with a unified, reproducible interface.
AgentBench	★ 3.5k	Benchmark LLMs as agents across operating systems, databases, web, and games
HELM	★ 2.8k	Stanford CRFM's Holistic Evaluation of Language Models framework for reproducible, transparent benchmarking of foundation and multimodal models across many scenarios and metrics.
LightEval	★ 2.5k	Hugging Face's toolkit for evaluating LLMs on standard benchmarks across multiple inference backends, with custom task and metric definitions.

// Overview

// What it does

// Getting started

Get the code

Pull and build the task images

Bring up the stack

// When to use it

// How AgentBench compares

Overview

What it does

Getting started

When to use it

How AgentBench compares