Overview
SWE-bench is a benchmark for evaluating large language models on real-world software issues collected from GitHub. Given a codebase and an issue, a model is asked to generate a patch that resolves the described problem, and the harness checks whether that patch makes the repository's tests pass.
It is built for researchers and engineers who work on coding agents and want a reproducible way to measure how well a model fixes real bugs. Because evaluation runs inside Docker containers, results are consistent across machines rather than depending on a local Python setup.
As a benchmark harness in the evaluation category, SWE-bench provides both the datasets (including SWE-bench Lite, SWE-bench Verified, and SWE-bench Multimodal) and the runner that grades patch predictions, so you can compare models on the same task set.
What it does
- Tasks built from real GitHub issues, where success means generating a patch that passes the repository's existing tests
- Containerized evaluation with Docker for reproducible runs across machines
- Multiple dataset variants: full SWE-bench, SWE-bench Lite, SWE-bench Verified (500 human-confirmed solvable problems), and SWE-bench Multimodal
- Local, parallel evaluation via a configurable number of workers
- Cloud-based evaluation options through Modal and the sb-cli tool
- Datasets loadable directly from Hugging Face for inference and training
Getting started
SWE-bench uses Docker for reproducible evaluations, so install Docker first, then build the package from source and run a quick gold-patch check to confirm your setup.
Install Docker
Follow Docker's official install guide for your OS. On Linux, also complete the post-installation steps. Evaluation is resource intensive: the docs recommend an x86_64 machine with at least 120GB free storage, 16GB RAM, and 8 CPU cores.
Install SWE-bench from source
Clone the repository and install it in editable mode with pip.
git clone git@github.com:princeton-nlp/SWE-bench.git
cd SWE-bench
pip install -e .Verify the installation
Run the harness against a single instance using the gold patch. On Apple Silicon or other ARM systems, add --namespace '' to build images locally.
python -m swebench.harness.run_evaluation \
--predictions_path gold \
--max_workers 1 \
--instance_ids sympy__sympy-20590 \
--run_id validate-goldEvaluate your own predictions
Point the harness at your predictions file and pick a dataset such as SWE-bench Lite. Results are written to the evaluation_results directory, with build and run logs under logs/.
python -m swebench.harness.run_evaluation \
--dataset_name princeton-nlp/SWE-bench_Lite \
--predictions_path <path_to_predictions> \
--max_workers <num_workers> \
--run_id <run_id>Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Measuring how well a new code-fixing model or agent resolves real GitHub issues against a standard task set
- Comparing model variants on SWE-bench Lite or Verified to track progress between training runs
- Validating gold patches or your own predictions in a reproducible Docker environment before reporting results
- Running large evaluations on the cloud via Modal or sb-cli when local hardware is limited
How SWE-bench compares
SWE-bench alongside other open-source benchmark harnesses tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| LM Evaluation Harness | ★ 13k | EleutherAI's framework for few-shot evaluation of language models across 60+ academic benchmarks, used as the backend for many leaderboards. |
| OpenCompass | ★ 7.1k | An LLM evaluation platform that runs models against 100+ datasets covering reasoning, knowledge, coding, and domain tasks, with leaderboards and multi-model support. |
| SWE-bench | ★ 5.2k | Test whether language models can resolve real GitHub issues |
| simple-evals | ★ 4.5k | OpenAI's lightweight library for running standard zero-shot, chain-of-thought benchmarks like MMLU, MATH, and GPQA to measure model accuracy. |
| lmms-eval | ★ 4.2k | An evaluation suite for large multimodal models that runs image, video, and audio benchmarks across many tasks with a unified, reproducible interface. |
| AgentBench | ★ 3.5k | A benchmark that evaluates LLMs as agents across diverse interactive environments such as operating systems, databases, web browsing, and games. |
| HELM | ★ 2.8k | Stanford CRFM's Holistic Evaluation of Language Models framework for reproducible, transparent benchmarking of foundation and multimodal models across many scenarios and metrics. |
| LightEval | ★ 2.5k | Hugging Face's toolkit for evaluating LLMs on standard benchmarks across multiple inference backends, with custom task and metric definitions. |