Zhejiang University / StepFun · 2026-04-15 · notable

SpatialEvo — Self-Evolving 3D Spatial Reasoning with Deterministic Geometric Environments

ZJU + StepFun release SpatialEvo: a self-evolving VLM framework for 3D spatial reasoning that uses point clouds and camera poses as deterministic reward oracles. No manual annotation needed. 3B and 7B Apache 2.0 weights on HuggingFace, 16 task categories.

SpatialEvo GitHub repository — self-evolving 3D spatial reasoning framework by ZJU and StepFun

3D spatial reasoning that self-improves using geometry as its own reward signal — no annotations required.

What is it?

SpatialEvo is a self-evolving framework for teaching vision-language models to reason about 3D space. Instead of relying on human labels or model-consensus pseudo-labels, it derives ground truth directly from point clouds and camera poses — physical geometry that is deterministically computable. A single shared-parameter model alternates between generating spatial questions and solving them, improving across 16 task categories including depth estimation, object counting, orientation, and relative positioning. Apache 2.0 weights are released for 3B and 7B scales based on Qwen2.5-VL.

How does it work?

The Deterministic Geometric Environment (DGE) wraps unannotated 3D scene assets (ScanNet, ScanNet++, ARKitScenes) in an online reward oracle that evaluates any spatial question-answer pair against exact geometric ground truth. The model trains under GRPO, alternating questioner and solver roles within a single shared-parameter policy. A task scheduler tracks per-category accuracy and concentrates training on the weakest categories, producing an adaptive curriculum without manual design.

Why does it matter?

Self-evolving training has a known failure mode in spatial tasks: models learn to agree with themselves rather than with physical reality, so errors accumulate. By anchoring rewards in computable geometry rather than model consensus, SpatialEvo sidesteps this. Consistent improvements on 9 benchmarks at two scales without degrading general multimodal performance suggest the approach generalizes. Open weights and code make it immediately applicable to embodied AI and robotics research.

Who is it for?

Researchers and engineers working on embodied AI, robotics perception, or spatial reasoning in VLMs.

Try it

git clone https://github.com/ZJU-REAL/SpatialEvo

Key numbers

Model sizes: 3B and 7B
Spatial task categories: 16
3B avg (9 benchmarks): 51.1%
7B avg (9 benchmarks): 54.7%
License: Apache 2.0