AI/TLDR

AgentBench

Benchmark LLMs as agents across operating systems, databases, web, and games

Overview

AgentBench is a benchmark for measuring how well large language models behave as autonomous agents, not just as text generators. Instead of single-shot questions, it puts a model into multi-turn interactive environments and scores whether it can reach a goal. The original release covers 8 distinct environments, including freshly built ones such as Operating System, Database, Knowledge Graph, Digital Card Game, and Lateral Thinking Puzzles, plus tasks recompiled from ALFWorld, WebShop, and Mind2Web.

The project is aimed at researchers and engineers who build or compare agentic LLMs and want a repeatable way to test tool use, planning, and grounded interaction. The current main branch is AgentBench FC (Function Calling), which uses a function-calling style prompt and is integrated with the AgentRL framework. It ships fully containerized deployment for five tasks: alfworld (AF), dbbench (DB), knowledgegraph (KG), os_interaction (OS), and webshop (WS).

As a benchmark harness in the evaluation category, AgentBench focuses on running agents against fixed task suites and reporting scores on a public leaderboard. Older versions (v0.1 and v0.2) remain available as tags if you need the earlier task set or the non-Docker conda workflow.

What it does

  • Evaluates LLMs as agents across diverse interactive environments rather than static QA
  • Original suite spans 8 environments: OS, DB, KG, Digital Card Game, Lateral Thinking Puzzles, plus ALFWorld, WebShop, and Mind2Web
  • AgentBench FC adds a function-calling prompt style integrated with the AgentRL RL framework
  • Fully containerized deployment for alfworld, dbbench, knowledgegraph, os_interaction, and webshop via Docker Compose
  • One-command setup brings up task workers, a controller, a Freebase server, and Redis for container allocation
  • Public leaderboard and Dev/Test splits for reporting and comparing model results

Getting started

AgentBench FC (the current main branch) runs its task environments in containers via Docker Compose. The steps below follow the README's one-command setup.

Get the code

Clone the repository and enter the directory.

bashbash
git clone https://github.com/THUDM/AgentBench.git
cd AgentBench

Pull and build the task images

Some tasks need prebuilt Docker images. Pull MySQL for dbbench and build the OS interaction images.

bashbash
# dbbench
docker pull mysql:8

# os_interaction
docker build -t local-os/default -f ./data/os_interaction/res/dockerfiles/default data/os_interaction/res/dockerfiles
docker build -t local-os/packages -f ./data/os_interaction/res/dockerfiles/packages data/os_interaction/res/dockerfiles
docker build -t local-os/ubuntu -f ./data/os_interaction/res/dockerfiles/ubuntu data/os_interaction/res/dockerfiles

Bring up the stack

Start the controller and task workers with Docker Compose. The webshop environment needs about 16GB of RAM, so make sure your machine has enough resources.

bashbash
docker compose -f extra/docker-compose.yml up

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Compare how different LLMs perform as agents on the same fixed set of interactive tasks
  • Stress-test a model's tool use and multi-turn planning in OS, database, and web environments
  • Track agent capability over model versions and report results to the public leaderboard
  • Provide task environments for training and evaluating function-calling agents alongside AgentRL

How AgentBench compares

AgentBench alongside other open-source benchmark harnesses tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
LM Evaluation Harness★ 13kEleutherAI's framework for few-shot evaluation of language models across 60+ academic benchmarks, used as the backend for many leaderboards.
OpenCompass★ 7.1kAn LLM evaluation platform that runs models against 100+ datasets covering reasoning, knowledge, coding, and domain tasks, with leaderboards and multi-model support.
SWE-bench★ 5.2kA benchmark and containerized harness that tests whether language models can resolve real GitHub issues by generating patches that pass a repository's tests.
simple-evals★ 4.5kOpenAI's lightweight library for running standard zero-shot, chain-of-thought benchmarks like MMLU, MATH, and GPQA to measure model accuracy.
lmms-eval★ 4.2kAn evaluation suite for large multimodal models that runs image, video, and audio benchmarks across many tasks with a unified, reproducible interface.
AgentBench★ 3.5kBenchmark LLMs as agents across operating systems, databases, web, and games
HELM★ 2.8kStanford CRFM's Holistic Evaluation of Language Models framework for reproducible, transparent benchmarking of foundation and multimodal models across many scenarios and metrics.
LightEval★ 2.5kHugging Face's toolkit for evaluating LLMs on standard benchmarks across multiple inference backends, with custom task and metric definitions.