NVIDIA Research · 2026-06-12 · notable

NVIDIA SpatialClaw — code as the action interface for spatial reasoning agents

NVIDIA paper introduces SpatialClaw, an agent framework that writes Python in a stateful kernel pre-loaded with perception tools. Average 59.9% across 20 spatial benchmarks, an 11.2-point gain over prior SOTA, on six VLM backbones with no per-model tuning.

SpatialClaw teaser figure showing a vision-language agent reasoning over a scene by writing Python code that calls depth, segmentation, and geometry tools — NVIDIA Research / SpatialClaw project page

NVIDIA framework that lets a spatial-reasoning agent write Python each turn instead of picking from a fixed tool menu.

Key specs

Avg accuracy (20 benchmarks)	59.9%
Gain over prior sota	+11.2 pp
Vlm backbones tested	6
Code stars (day 2)	70

Benchmarks

Avg accuracy · 20 spatial benchmarks

SpatialClaw		59.9%

source ↗

What is it?

SpatialClaw is a paper and open code release from NVIDIA Research for vision-language-action agents that answer spatial questions from images and video — things like 'is the cup behind the laptop?' or 'how far is the chair from the door?'. Instead of choosing from a fixed tool menu or committing to a full plan upfront, the agent writes Python code each turn and observes the result.

How does it work?

The agent runs inside a persistent Python kernel that is pre-loaded with the input frames and a set of perception primitives — Depth Anything 3, SAM 3, NumPy, SciPy. Each step is a five-stage loop: plan, generate code, execute, read back the result (a depth map, a mask, a number), and decide whether to refine or submit an answer. Because the kernel keeps state, later steps can reuse earlier computations.

Why does it matter?

Fixed tool menus and one-shot reasoning both struggle on spatial tasks that need chained geometric computation across frames. Treating code as the action interface lifts average accuracy by 11.2 points across 20 spatial benchmarks, with the biggest gains on video and multi-view tasks. The same setup works on six VLM backbones from two model families without per-model tuning, so other teams can drop it in.

Who is it for?

VLM researchers and robotics engineers working on spatial reasoning

Try it

git clone https://github.com/NVlabs/SpatialClaw