What does GeneBench-Pro measure?

GeneBench-Pro measures whether an AI agent can carry a real computational-biology analysis end to end — reading a noisy dataset, choosing the right method, revising when early diagnostics flag a problem, and producing a defensible answer. It targets research judgment, not just knowledge recall.

How did the top models score on GeneBench-Pro?

OpenAI reports GPT-5.6 Sol Pro at 31.5% and GPT-5.6 Sol at 28.7% at its highest reasoning setting. Claude Opus 4.8 is the strongest non-OpenAI model at 16.0%. Open-weight GLM 5.2 lags further, showing a larger gap to coding benchmarks.

Is GeneBench-Pro open source?

OpenAI published a 10-question public case-study package on Hugging Face under CC-BY-4.0, with staged data files, ground-truth answers, and a Python grader. The full 129-problem benchmark is not open; a 50-question subset goes to Artificial Analysis for independent third-party evaluation.

How is GeneBench-Pro different from the original GeneBench?

The original GeneBench topped out around 5% for GPT-5 when OpenAI built it, and models started to saturate; GeneBench-Pro adds harder, noisier problems that stress an agent's iterative choices. OpenAI also had 82 of the 129 tasks reviewed by outside biologists to confirm they reflect real research.

OpenAI · 2026-06-30 · major

GeneBench-Pro — OpenAI's 129-problem computational-biology benchmark

GeneBench-Pro is a 129-problem benchmark from OpenAI that grades AI agents on messy, judgment-heavy computational biology. GPT-5.6 Sol Pro tops it at 31.5%; Claude Opus 4.8 lands second at 16.0%.

GeneBench-Pro public case studies dataset card on Hugging Face

A research-level bio benchmark where top frontier models still fail more than two thirds of the time.

Quick facts

Maker	OpenAI
Tasks	129 problems across 10 domains, 21 sub-domains
Format	Agent gets data files, a Python + PLINK 2.0 workspace, a target estimand
Top score	GPT-5.6 Sol Pro — 31.5%
Runner-up	Claude Opus 4.8 — 16.0%
Public package	10-question subset on Hugging Face, CC-BY-4.0

Benchmarks

GeneBench-Pro (pass rate)

GPT-5.6 Sol Pro		31.5%
GPT-5.6 Sol (high reasoning)		28.7%
Claude Opus 4.8		16%

source ↗

What is it?

GeneBench-Pro is a benchmark from OpenAI that hands an AI agent 129 synthetic-but-realistic problems from genomics, quantitative biology and translational medicine. Each task pairs a noisy dataset with a target estimand tied to a downstream decision.

How does it work?

GeneBench-Pro drops the agent into an isolated workspace with data files, a short experimental context, and a standard bioinformatics stack (Python, PLINK 2.0). Every problem is built from a known data-generating process, so answers are graded against ground truth and ablations verify that plausible-but-wrong analyses fail.

Why does it matter?

Coding evals are saturating; GeneBench-Pro reopens the gap. The best model solves under a third of problems, and 82 of the 129 tasks were vetted by graduate students, postdocs, industry scientists and university professors, so scores map to real research judgment rather than exam-style trivia.

Who is it for?

AI evaluation researchers and bio/AI labs

Frequently asked questions

What does GeneBench-Pro measure?: GeneBench-Pro measures whether an AI agent can carry a real computational-biology analysis end to end — reading a noisy dataset, choosing the right method, revising when early diagnostics flag a problem, and producing a defensible answer. It targets research judgment, not just knowledge recall.
How did the top models score on GeneBench-Pro?: OpenAI reports GPT-5.6 Sol Pro at 31.5% and GPT-5.6 Sol at 28.7% at its highest reasoning setting. Claude Opus 4.8 is the strongest non-OpenAI model at 16.0%. Open-weight GLM 5.2 lags further, showing a larger gap to coding benchmarks.
Is GeneBench-Pro open source?: OpenAI published a 10-question public case-study package on Hugging Face under CC-BY-4.0, with staged data files, ground-truth answers, and a Python grader. The full 129-problem benchmark is not open; a 50-question subset goes to Artificial Analysis for independent third-party evaluation.
How is GeneBench-Pro different from the original GeneBench?: The original GeneBench topped out around 5% for GPT-5 when OpenAI built it, and models started to saturate; GeneBench-Pro adds harder, noisier problems that stress an agent's iterative choices. OpenAI also had 82 of the 129 tasks reviewed by outside biologists to confirm they reflect real research.

Try it

huggingface.co/datasets/ajh-oai/genebench-pro-public-package