SWE-bench

Test whether language models can resolve real GitHub issues

github.com/SWE-bench/SWE-bench★ 5.2k swebench.com

Overview

SWE-bench is a benchmark for evaluating large language models on real-world software issues collected from GitHub. Given a codebase and an issue, a model is asked to generate a patch that resolves the described problem, and the harness checks whether that patch makes the repository's tests pass.

It is built for researchers and engineers who work on coding agents and want a reproducible way to measure how well a model fixes real bugs. Because evaluation runs inside Docker containers, results are consistent across machines rather than depending on a local Python setup.

As a benchmark harness in the evaluation category, SWE-bench provides both the datasets (including SWE-bench Lite, SWE-bench Verified, and SWE-bench Multimodal) and the runner that grades patch predictions, so you can compare models on the same task set.

What it does

Tasks built from real GitHub issues, where success means generating a patch that passes the repository's existing tests
Containerized evaluation with Docker for reproducible runs across machines
Multiple dataset variants: full SWE-bench, SWE-bench Lite, SWE-bench Verified (500 human-confirmed solvable problems), and SWE-bench Multimodal
Local, parallel evaluation via a configurable number of workers
Cloud-based evaluation options through Modal and the sb-cli tool
Datasets loadable directly from Hugging Face for inference and training

Getting started

SWE-bench uses Docker for reproducible evaluations, so install Docker first, then build the package from source and run a quick gold-patch check to confirm your setup.

Install Docker

Follow Docker's official install guide for your OS. On Linux, also complete the post-installation steps. Evaluation is resource intensive: the docs recommend an x86_64 machine with at least 120GB free storage, 16GB RAM, and 8 CPU cores.

Install SWE-bench from source

Clone the repository and install it in editable mode with pip.

bashbash

git clone git@github.com:princeton-nlp/SWE-bench.git
cd SWE-bench
pip install -e .

Verify the installation

Run the harness against a single instance using the gold patch. On Apple Silicon or other ARM systems, add --namespace '' to build images locally.

bashbash

python -m swebench.harness.run_evaluation \
    --predictions_path gold \
    --max_workers 1 \
    --instance_ids sympy__sympy-20590 \
    --run_id validate-gold

Evaluate your own predictions

Point the harness at your predictions file and pick a dataset such as SWE-bench Lite. Results are written to the evaluation_results directory, with build and run logs under logs/.

bashbash

python -m swebench.harness.run_evaluation \
    --dataset_name princeton-nlp/SWE-bench_Lite \
    --predictions_path <path_to_predictions> \
    --max_workers <num_workers> \
    --run_id <run_id>

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Measuring how well a new code-fixing model or agent resolves real GitHub issues against a standard task set
Comparing model variants on SWE-bench Lite or Verified to track progress between training runs
Validating gold patches or your own predictions in a reproducible Docker environment before reporting results
Running large evaluations on the cloud via Modal or sb-cli when local hardware is limited

How SWE-bench compares

SWE-bench alongside other open-source benchmark harnesses tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
LM Evaluation Harness	★ 13k	EleutherAI's framework for few-shot evaluation of language models across 60+ academic benchmarks, used as the backend for many leaderboards.
OpenCompass	★ 7.1k	An LLM evaluation platform that runs models against 100+ datasets covering reasoning, knowledge, coding, and domain tasks, with leaderboards and multi-model support.
SWE-bench	★ 5.2k	Test whether language models can resolve real GitHub issues
simple-evals	★ 4.5k	OpenAI's lightweight library for running standard zero-shot, chain-of-thought benchmarks like MMLU, MATH, and GPQA to measure model accuracy.
lmms-eval	★ 4.2k	An evaluation suite for large multimodal models that runs image, video, and audio benchmarks across many tasks with a unified, reproducible interface.
AgentBench	★ 3.5k	A benchmark that evaluates LLMs as agents across diverse interactive environments such as operating systems, databases, web browsing, and games.
HELM	★ 2.8k	Stanford CRFM's Holistic Evaluation of Language Models framework for reproducible, transparent benchmarking of foundation and multimodal models across many scenarios and metrics.
LightEval	★ 2.5k	Hugging Face's toolkit for evaluating LLMs on standard benchmarks across multiple inference backends, with custom task and metric definitions.

// Overview

// What it does

// Getting started

Install Docker

Install SWE-bench from source

Verify the installation

Evaluate your own predictions

// When to use it

// How SWE-bench compares

Overview

What it does

Getting started

When to use it

How SWE-bench compares