AI/TLDR

OpenAI · 2026-06-17 · major

LifeSciBench — OpenAI's 750-task benchmark for life-science research

OpenAI's LifeSciBench grades AI models on 750 expert-authored life-science tasks using rubrics. The strongest model, GPT-Rosalind, passes only 36.1%, with attached data files cited as the main bottleneck.

LifeSciBench announcement card — OpenAI's 750-task life-science benchmark

OpenAI's new biology benchmark grades models with expert rubrics, and even GPT-Rosalind passes only one task in three.

Key specs

Tasks750
Rubric criteria19,020
Expert authors173
Reviewers453
Gpt rosalind pass rate36.1%
Gpt 5.5 pass rate25.7%
Gemini 3.1 pro pass rate23.6%

Quick facts

MakerOpenAI
Tasks750 expert-authored
Rubric criteria19,020 (~25 per task)
Top scoreGPT-Rosalind, 36.1% pass rate
Authors173 PhD scientists
Coverage7 workflows × 7 biology domains
FormatFree-response with rubric scoring

Benchmarks

LifeSciBench Task Pass Rate (≥70% rubric)
GPT-Rosalind36.1%
GPT-5.525.7%
Gemini 3.1 Pro23.6%
GPT-5.420.7%
Grok 4.313%
source ↗

What is it?

LifeSciBench is OpenAI's new evaluation suite for AI in life-science research. It contains 750 free-response tasks across seven biology workflows and seven domains, written by 173 PhD-level scientists from biotech and pharma. Each task is graded by a detailed rubric instead of a single right answer.

How does it work?

LifeSciBench grades model output with 19,020 expert-written criteria — about 25 per task — and tracks two scores: a normalized rubric score and a task pass rate at the 70% threshold. About 53% of LifeSciBench tasks include attached artifacts (figures, PDFs, sequence files, chemical structures, tables) that models must read, and 79% need multi-step reasoning averaging four decision steps.

Why does it matter?

LifeSciBench gives biotech teams a hard, realistic yardstick for picking models to run scientific workflows. OpenAI's leaderboard shows even the strongest model (GPT-Rosalind, 36.1% pass rate) is far from saturating the benchmark, and that file-handling — not text reasoning — is where current models break down.

Who is it for?

AI eval researchers, biotech and pharma teams deploying LLMs

Frequently asked questions

What is LifeSciBench?
LifeSciBench is OpenAI's 750-task evaluation that grades how well AI models handle real life-science research. Each task uses an expert-written rubric instead of multiple choice, and the benchmark spans seven biological domains and seven research workflows.
How well do current models do on LifeSciBench?
LifeSciBench's strongest model is GPT-Rosalind at 36.1% task pass rate (rubric score 0.576). GPT-5.5 reaches 25.7%, Gemini 3.1 Pro 23.6%, GPT-5.4 20.7%, and Grok 4.3 13.0%. OpenAI says the benchmark is far from saturated.
Why is LifeSciBench harder than older biology benchmarks?
LifeSciBench uses free-response tasks graded by 19,020 rubric criteria written by 173 PhD scientists, not multiple-choice answers. About 53% of tasks require models to interpret attached artifacts like figures, PDFs, sequence files, and chemical structures.
What's the biggest weakness LifeSciBench exposed?
Artifact processing is the main bottleneck. GPT-Rosalind's pass rate drops 17 points from 45.1% on text-only tasks to 28.1% on tasks with attached data files, showing that current models struggle with scientific figures and structured biology files.

Try it

Read the LifeSciBench paper (PDF) on cdn.openai.com

Sources · 3 outlets

Tags

  • openai
  • benchmark
  • evaluation
  • biology
  • life-sciences
  • gpt-rosalind
  • rubric-grading
  • scientific-research

← All releases · Learn AI