NVIDIA Research · 2026-06-12 · notable
NVIDIA SpatialClaw — code as the action interface for spatial reasoning agents
NVIDIA paper introduces SpatialClaw, an agent framework that writes Python in a stateful kernel pre-loaded with perception tools. Average 59.9% across 20 spatial benchmarks, an 11.2-point gain over prior SOTA, on six VLM backbones with no per-model tuning.

NVIDIA framework that lets a spatial-reasoning agent write Python each turn instead of picking from a fixed tool menu.
What is it
SpatialClaw is a paper and open code release from NVIDIA Research for vision-language-action agents that answer spatial questions from images and video — things like 'is the cup behind the laptop?' or 'how far is the chair from the door?'. Instead of choosing from a fixed tool menu or committing to a full plan upfront, the agent writes Python code each turn and observes the result.
How it works
The agent runs inside a persistent Python kernel that is pre-loaded with the input frames and a set of perception primitives — Depth Anything 3, SAM 3, NumPy, SciPy. Each step is a five-stage loop: plan, generate code, execute, read back the result (a depth map, a mask, a number), and decide whether to refine or submit an answer. Because the kernel keeps state, later steps can reuse earlier computations.
Why it matters
Fixed tool menus and one-shot reasoning both struggle on spatial tasks that need chained geometric computation across frames. Treating code as the action interface lifts average accuracy by 11.2 points across 20 spatial benchmarks, with the biggest gains on video and multi-view tasks. The same setup works on six VLM backbones from two model families without per-model tuning, so other teams can drop it in.
Who it's for
VLM researchers and robotics engineers working on spatial reasoning
Try it
git clone https://github.com/NVlabs/SpatialClawKey numbers
- Avg accuracy (20 benchmarks): 59.9%
- Gain over prior SOTA: +11.2 pp
- VLM backbones tested: 6
- Code stars (day 2): 70