AI/TLDR

SWE-bench

Test whether language models can resolve real GitHub issues

Overview

SWE-bench is a benchmark for evaluating large language models on real-world software issues collected from GitHub. Given a codebase and an issue, a model is asked to generate a patch that resolves the described problem, and the harness checks whether that patch makes the repository's tests pass.

It is built for researchers and engineers who work on coding agents and want a reproducible way to measure how well a model fixes real bugs. Because evaluation runs inside Docker containers, results are consistent across machines rather than depending on a local Python setup.

As a benchmark harness in the evaluation category, SWE-bench provides both the datasets (including SWE-bench Lite, SWE-bench Verified, and SWE-bench Multimodal) and the runner that grades patch predictions, so you can compare models on the same task set.

What it does

  • Tasks built from real GitHub issues, where success means generating a patch that passes the repository's existing tests
  • Containerized evaluation with Docker for reproducible runs across machines
  • Multiple dataset variants: full SWE-bench, SWE-bench Lite, SWE-bench Verified (500 human-confirmed solvable problems), and SWE-bench Multimodal
  • Local, parallel evaluation via a configurable number of workers
  • Cloud-based evaluation options through Modal and the sb-cli tool
  • Datasets loadable directly from Hugging Face for inference and training

Getting started

SWE-bench uses Docker for reproducible evaluations, so install Docker first, then build the package from source and run a quick gold-patch check to confirm your setup.

Install Docker

Follow Docker's official install guide for your OS. On Linux, also complete the post-installation steps. Evaluation is resource intensive: the docs recommend an x86_64 machine with at least 120GB free storage, 16GB RAM, and 8 CPU cores.

Install SWE-bench from source

Clone the repository and install it in editable mode with pip.

bashbash
git clone git@github.com:princeton-nlp/SWE-bench.git
cd SWE-bench
pip install -e .

Verify the installation

Run the harness against a single instance using the gold patch. On Apple Silicon or other ARM systems, add --namespace '' to build images locally.

bashbash
python -m swebench.harness.run_evaluation \
    --predictions_path gold \
    --max_workers 1 \
    --instance_ids sympy__sympy-20590 \
    --run_id validate-gold

Evaluate your own predictions

Point the harness at your predictions file and pick a dataset such as SWE-bench Lite. Results are written to the evaluation_results directory, with build and run logs under logs/.

bashbash
python -m swebench.harness.run_evaluation \
    --dataset_name princeton-nlp/SWE-bench_Lite \
    --predictions_path <path_to_predictions> \
    --max_workers <num_workers> \
    --run_id <run_id>

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Measuring how well a new code-fixing model or agent resolves real GitHub issues against a standard task set
  • Comparing model variants on SWE-bench Lite or Verified to track progress between training runs
  • Validating gold patches or your own predictions in a reproducible Docker environment before reporting results
  • Running large evaluations on the cloud via Modal or sb-cli when local hardware is limited

How SWE-bench compares

SWE-bench alongside other open-source benchmark harnesses tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
LM Evaluation Harness★ 13kEleutherAI's framework for few-shot evaluation of language models across 60+ academic benchmarks, used as the backend for many leaderboards.
OpenCompass★ 7.1kAn LLM evaluation platform that runs models against 100+ datasets covering reasoning, knowledge, coding, and domain tasks, with leaderboards and multi-model support.
SWE-bench★ 5.2kTest whether language models can resolve real GitHub issues
simple-evals★ 4.5kOpenAI's lightweight library for running standard zero-shot, chain-of-thought benchmarks like MMLU, MATH, and GPQA to measure model accuracy.
lmms-eval★ 4.2kAn evaluation suite for large multimodal models that runs image, video, and audio benchmarks across many tasks with a unified, reproducible interface.
AgentBench★ 3.5kA benchmark that evaluates LLMs as agents across diverse interactive environments such as operating systems, databases, web browsing, and games.
HELM★ 2.8kStanford CRFM's Holistic Evaluation of Language Models framework for reproducible, transparent benchmarking of foundation and multimodal models across many scenarios and metrics.
LightEval★ 2.5kHugging Face's toolkit for evaluating LLMs on standard benchmarks across multiple inference backends, with custom task and metric definitions.