Overview
AgentBench is a benchmark for measuring how well large language models behave as autonomous agents, not just as text generators. Instead of single-shot questions, it puts a model into multi-turn interactive environments and scores whether it can reach a goal. The original release covers 8 distinct environments, including freshly built ones such as Operating System, Database, Knowledge Graph, Digital Card Game, and Lateral Thinking Puzzles, plus tasks recompiled from ALFWorld, WebShop, and Mind2Web.
The project is aimed at researchers and engineers who build or compare agentic LLMs and want a repeatable way to test tool use, planning, and grounded interaction. The current main branch is AgentBench FC (Function Calling), which uses a function-calling style prompt and is integrated with the AgentRL framework. It ships fully containerized deployment for five tasks: alfworld (AF), dbbench (DB), knowledgegraph (KG), os_interaction (OS), and webshop (WS).
As a benchmark harness in the evaluation category, AgentBench focuses on running agents against fixed task suites and reporting scores on a public leaderboard. Older versions (v0.1 and v0.2) remain available as tags if you need the earlier task set or the non-Docker conda workflow.
What it does
- Evaluates LLMs as agents across diverse interactive environments rather than static QA
- Original suite spans 8 environments: OS, DB, KG, Digital Card Game, Lateral Thinking Puzzles, plus ALFWorld, WebShop, and Mind2Web
- AgentBench FC adds a function-calling prompt style integrated with the AgentRL RL framework
- Fully containerized deployment for alfworld, dbbench, knowledgegraph, os_interaction, and webshop via Docker Compose
- One-command setup brings up task workers, a controller, a Freebase server, and Redis for container allocation
- Public leaderboard and Dev/Test splits for reporting and comparing model results
Getting started
AgentBench FC (the current main branch) runs its task environments in containers via Docker Compose. The steps below follow the README's one-command setup.
Get the code
Clone the repository and enter the directory.
git clone https://github.com/THUDM/AgentBench.git
cd AgentBenchPull and build the task images
Some tasks need prebuilt Docker images. Pull MySQL for dbbench and build the OS interaction images.
# dbbench
docker pull mysql:8
# os_interaction
docker build -t local-os/default -f ./data/os_interaction/res/dockerfiles/default data/os_interaction/res/dockerfiles
docker build -t local-os/packages -f ./data/os_interaction/res/dockerfiles/packages data/os_interaction/res/dockerfiles
docker build -t local-os/ubuntu -f ./data/os_interaction/res/dockerfiles/ubuntu data/os_interaction/res/dockerfilesBring up the stack
Start the controller and task workers with Docker Compose. The webshop environment needs about 16GB of RAM, so make sure your machine has enough resources.
docker compose -f extra/docker-compose.yml upCommands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Compare how different LLMs perform as agents on the same fixed set of interactive tasks
- Stress-test a model's tool use and multi-turn planning in OS, database, and web environments
- Track agent capability over model versions and report results to the public leaderboard
- Provide task environments for training and evaluating function-calling agents alongside AgentRL
How AgentBench compares
AgentBench alongside other open-source benchmark harnesses tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| LM Evaluation Harness | ★ 13k | EleutherAI's framework for few-shot evaluation of language models across 60+ academic benchmarks, used as the backend for many leaderboards. |
| OpenCompass | ★ 7.1k | An LLM evaluation platform that runs models against 100+ datasets covering reasoning, knowledge, coding, and domain tasks, with leaderboards and multi-model support. |
| SWE-bench | ★ 5.2k | A benchmark and containerized harness that tests whether language models can resolve real GitHub issues by generating patches that pass a repository's tests. |
| simple-evals | ★ 4.5k | OpenAI's lightweight library for running standard zero-shot, chain-of-thought benchmarks like MMLU, MATH, and GPQA to measure model accuracy. |
| lmms-eval | ★ 4.2k | An evaluation suite for large multimodal models that runs image, video, and audio benchmarks across many tasks with a unified, reproducible interface. |
| AgentBench | ★ 3.5k | Benchmark LLMs as agents across operating systems, databases, web, and games |
| HELM | ★ 2.8k | Stanford CRFM's Holistic Evaluation of Language Models framework for reproducible, transparent benchmarking of foundation and multimodal models across many scenarios and metrics. |
| LightEval | ★ 2.5k | Hugging Face's toolkit for evaluating LLMs on standard benchmarks across multiple inference backends, with custom task and metric definitions. |