NVIDIA · 2026-04-14 · notable

RoboLab — NVIDIA's 120-Task Sim Benchmark Shows SOTA Robot Policies Top Out at 25.8%

NVIDIA's simulation benchmark for robot manipulation policies: 120 tasks across visual, procedural, and relational competencies. The best current model (π0.5) tops out at 25.8% success — dropping to 16.8% on vague instructions. #1 on HuggingFace papers today.

NVLabs/RoboLab GitHub repository — NVIDIA simulation benchmark for robot manipulation policies

NVIDIA's new 120-task simulation benchmark reveals that the best open robot policies succeed less than 26% of the time on manipulation tasks.

Key specs

Tasks	120
Best model success rate (specific instructions)	25.8% (π0.5)
Best model success rate (vague instructions)	16.8% (π0.5)
Avg subtasks per task	2.02
Hf upvotes	72 (#1 today)

What is it?

RoboLab is a simulation benchmarking platform from NVIDIA Research for evaluating task-generalist robot manipulation policies. It ships with RoboLab-120 — 120 tasks spanning pick-and-place, stacking, rearrangement, and tool use, organized across three competency axes: visual (how the scene looks), procedural (step ordering), and relational (object relationships). It is built on NVIDIA Isaac Sim 5.0 and supports LLM-assisted scene generation via Claude Code skills.

How does it work?

The benchmark uses a server-client architecture where a policy model runs independently and connects to the Isaac Sim environment via a lightweight inference client. Each task can be described with specific or vague natural language instructions, revealing how much policies rely on precise command phrasing. Tasks are evaluated across three difficulty levels, with object sets that have only 68.7% vocabulary overlap with the DROID training dataset — intentionally probing generalization outside familiar objects.

Why does it matter?

The main finding is a sobering reality check: π0.5, currently one of the strongest open vision-language-action models, achieves only 25.8% success on specific instructions and falls to 16.8% when instructions are vague. The same scene, same goal, different wording — and the policy breaks. This quantifies a known qualitative weakness in current VLAs and gives robotics researchers a reproducible testbed to track progress. The CC-BY-NC-4.0 license permits academic use.

Who is it for?

Robotics researchers and teams building or evaluating VLA (vision-language-action) models.

Try it

git clone https://github.com/NVLabs/RoboLab  # requires Isaac Sim 5.0