What is LifeSciBench?

LifeSciBench is OpenAI's 750-task evaluation that grades how well AI models handle real life-science research. Each task uses an expert-written rubric instead of multiple choice, and the benchmark spans seven biological domains and seven research workflows.

How well do current models do on LifeSciBench?

LifeSciBench's strongest model is GPT-Rosalind at 36.1% task pass rate (rubric score 0.576). GPT-5.5 reaches 25.7%, Gemini 3.1 Pro 23.6%, GPT-5.4 20.7%, and Grok 4.3 13.0%. OpenAI says the benchmark is far from saturated.

Why is LifeSciBench harder than older biology benchmarks?

LifeSciBench uses free-response tasks graded by 19,020 rubric criteria written by 173 PhD scientists, not multiple-choice answers. About 53% of tasks require models to interpret attached artifacts like figures, PDFs, sequence files, and chemical structures.

What's the biggest weakness LifeSciBench exposed?

Artifact processing is the main bottleneck. GPT-Rosalind's pass rate drops 17 points from 45.1% on text-only tasks to 28.1% on tasks with attached data files, showing that current models struggle with scientific figures and structured biology files.

OpenAI · 2026-06-17 · major

LifeSciBench — OpenAI's 750-task benchmark for life-science research

OpenAI's LifeSciBench grades AI models on 750 expert-authored life-science tasks using rubrics. The strongest model, GPT-Rosalind, passes only 36.1%, with attached data files cited as the main bottleneck.

LifeSciBench announcement card — OpenAI's 750-task life-science benchmark

OpenAI's new biology benchmark grades models with expert rubrics, and even GPT-Rosalind passes only one task in three.

Key specs

Tasks	750
Rubric criteria	19,020
Expert authors	173
Reviewers	453
Gpt rosalind pass rate	36.1%
Gpt 5.5 pass rate	25.7%
Gemini 3.1 pro pass rate	23.6%

Quick facts

Maker	OpenAI
Tasks	750 expert-authored
Rubric criteria	19,020 (~25 per task)
Top score	GPT-Rosalind, 36.1% pass rate
Authors	173 PhD scientists
Coverage	7 workflows × 7 biology domains
Format	Free-response with rubric scoring

Benchmarks

LifeSciBench Task Pass Rate (≥70% rubric)

GPT-Rosalind		36.1%
GPT-5.5		25.7%
Gemini 3.1 Pro		23.6%
GPT-5.4		20.7%
Grok 4.3		13%

source ↗

What is it?

LifeSciBench is OpenAI's new evaluation suite for AI in life-science research. It contains 750 free-response tasks across seven biology workflows and seven domains, written by 173 PhD-level scientists from biotech and pharma. Each task is graded by a detailed rubric instead of a single right answer.

How does it work?

LifeSciBench grades model output with 19,020 expert-written criteria — about 25 per task — and tracks two scores: a normalized rubric score and a task pass rate at the 70% threshold. About 53% of LifeSciBench tasks include attached artifacts (figures, PDFs, sequence files, chemical structures, tables) that models must read, and 79% need multi-step reasoning averaging four decision steps.

Why does it matter?

LifeSciBench gives biotech teams a hard, realistic yardstick for picking models to run scientific workflows. OpenAI's leaderboard shows even the strongest model (GPT-Rosalind, 36.1% pass rate) is far from saturating the benchmark, and that file-handling — not text reasoning — is where current models break down.

Who is it for?

AI eval researchers, biotech and pharma teams deploying LLMs

Frequently asked questions

What is LifeSciBench?: LifeSciBench is OpenAI's 750-task evaluation that grades how well AI models handle real life-science research. Each task uses an expert-written rubric instead of multiple choice, and the benchmark spans seven biological domains and seven research workflows.
How well do current models do on LifeSciBench?: LifeSciBench's strongest model is GPT-Rosalind at 36.1% task pass rate (rubric score 0.576). GPT-5.5 reaches 25.7%, Gemini 3.1 Pro 23.6%, GPT-5.4 20.7%, and Grok 4.3 13.0%. OpenAI says the benchmark is far from saturated.
Why is LifeSciBench harder than older biology benchmarks?: LifeSciBench uses free-response tasks graded by 19,020 rubric criteria written by 173 PhD scientists, not multiple-choice answers. About 53% of tasks require models to interpret attached artifacts like figures, PDFs, sequence files, and chemical structures.
What's the biggest weakness LifeSciBench exposed?: Artifact processing is the main bottleneck. GPT-Rosalind's pass rate drops 17 points from 45.1% on text-only tasks to 28.1% on tasks with attached data files, showing that current models struggle with scientific figures and structured biology files.

Try it

Read the LifeSciBench paper (PDF) on cdn.openai.com