OpenAI · 2026-06-17 · major
LifeSciBench — OpenAI's 750-task benchmark for life-science research
OpenAI's LifeSciBench grades AI models on 750 expert-authored life-science tasks using rubrics. The strongest model, GPT-Rosalind, passes only 36.1%, with attached data files cited as the main bottleneck.

OpenAI's new biology benchmark grades models with expert rubrics, and even GPT-Rosalind passes only one task in three.
Key specs
| Tasks | 750 |
|---|---|
| Rubric criteria | 19,020 |
| Expert authors | 173 |
| Reviewers | 453 |
| Gpt rosalind pass rate | 36.1% |
| Gpt 5.5 pass rate | 25.7% |
| Gemini 3.1 pro pass rate | 23.6% |
Quick facts
| Maker | OpenAI |
|---|---|
| Tasks | 750 expert-authored |
| Rubric criteria | 19,020 (~25 per task) |
| Top score | GPT-Rosalind, 36.1% pass rate |
| Authors | 173 PhD scientists |
| Coverage | 7 workflows × 7 biology domains |
| Format | Free-response with rubric scoring |
Benchmarks
| GPT-Rosalind | 36.1% | |
|---|---|---|
| GPT-5.5 | 25.7% | |
| Gemini 3.1 Pro | 23.6% | |
| GPT-5.4 | 20.7% | |
| Grok 4.3 | 13% |
What is it?
LifeSciBench is OpenAI's new evaluation suite for AI in life-science research. It contains 750 free-response tasks across seven biology workflows and seven domains, written by 173 PhD-level scientists from biotech and pharma. Each task is graded by a detailed rubric instead of a single right answer.
How does it work?
LifeSciBench grades model output with 19,020 expert-written criteria — about 25 per task — and tracks two scores: a normalized rubric score and a task pass rate at the 70% threshold. About 53% of LifeSciBench tasks include attached artifacts (figures, PDFs, sequence files, chemical structures, tables) that models must read, and 79% need multi-step reasoning averaging four decision steps.
Why does it matter?
LifeSciBench gives biotech teams a hard, realistic yardstick for picking models to run scientific workflows. OpenAI's leaderboard shows even the strongest model (GPT-Rosalind, 36.1% pass rate) is far from saturating the benchmark, and that file-handling — not text reasoning — is where current models break down.
Who is it for?
AI eval researchers, biotech and pharma teams deploying LLMs
Frequently asked questions
- What is LifeSciBench?
- LifeSciBench is OpenAI's 750-task evaluation that grades how well AI models handle real life-science research. Each task uses an expert-written rubric instead of multiple choice, and the benchmark spans seven biological domains and seven research workflows.
- How well do current models do on LifeSciBench?
- LifeSciBench's strongest model is GPT-Rosalind at 36.1% task pass rate (rubric score 0.576). GPT-5.5 reaches 25.7%, Gemini 3.1 Pro 23.6%, GPT-5.4 20.7%, and Grok 4.3 13.0%. OpenAI says the benchmark is far from saturated.
- Why is LifeSciBench harder than older biology benchmarks?
- LifeSciBench uses free-response tasks graded by 19,020 rubric criteria written by 173 PhD scientists, not multiple-choice answers. About 53% of tasks require models to interpret attached artifacts like figures, PDFs, sequence files, and chemical structures.
- What's the biggest weakness LifeSciBench exposed?
- Artifact processing is the main bottleneck. GPT-Rosalind's pass rate drops 17 points from 45.1% on text-only tasks to 28.1% on tasks with attached data files, showing that current models struggle with scientific figures and structured biology files.
Try it
Read the LifeSciBench paper (PDF) on cdn.openai.com