Allen Institute for AI · 2026-04-07 · major

WildDet3D — Promptable monocular 3D detection in the wild

Item: WildDet3D — Promptable monocular 3D detection in the wild
Rating: 4
Author: AI/TLDR

Ai2 ships an open-world 3D detector that takes one RGB image plus a text, box, or point prompt and returns 3D boxes. Backed by a new 1M-image dataset with 3.7M human-verified annotations across 13K categories and zero-shot wins on Argoverse 2 and ScanNet.

WildDet3D — example 3D bounding boxes rendered in a real-world scene, from Ai2's announcement

Single-image 3D detection that takes text, box or point prompts and works zero-shot across indoor, street and nature scenes.

Key specs

Omni3 d ap (text prompt)	34.2
Omni3 d ap (box prompt)	36.4
Argoverse 2 ods (zero shot)	40.3
Scan net ods (zero shot)	48.9
In the wild ap (700+ categories)	22.6
Training images	1,003,886
Verified 3 d annotations	3,910,855
Categories covered	13,000+
Training epochs	12

What is it?

WildDet3D is an open-vocabulary monocular 3D object detector from the Allen Institute for AI. Given one RGB image it returns 3D bounding boxes for the objects you ask for — either by category name, a 2D box, or a point click — and optionally takes a depth map when one is available. The release bundles the model family, a 1M-image training set (WildDet3D-Data), a Hugging Face Space demo, a Models collection, and an iOS app that runs the detector on-device.

How does it work?

The architecture is a unified geometry-aware transformer that consumes the image plus an optional depth signal and produces per-object 3D boxes conditioned on a prompt embedding. To build training data the team generated candidate 3D boxes from existing 2D annotations across dozens of datasets, then kept only the human-verified ones — ending up with 3.9M annotations across 13K categories in indoor (52%), urban (32%) and nature (15%) scenes. The model converges in roughly 12 epochs versus 80–120 for prior open-vocabulary 3D detectors.

Why does it matter?

Before WildDet3D, open-vocabulary 3D detection from a single image was limited to a few dozen indoor categories. This release jumps category coverage by more than two orders of magnitude and improves zero-shot Argoverse 2 ODS from 23.8 to 40.3 and ScanNet from 31.5 to 48.9. It also ships with a real-time iPhone app, which makes it the first open 3D detector most practitioners can actually point at the room they're sitting in.

Who is it for?

Computer vision researchers, robotics and AR/VR developers, 3D perception hobbyists.

Try it

huggingface.co/spaces/allenai/WildDet3D