AI/TLDR

Allen Institute for AI · 2026-06-17 · major

MolmoMotion — Ai2's language-guided 3D motion forecasting models

Ai2 released MolmoMotion, two open models that predict where points on objects will move in 3D space from a video frame plus a text instruction. The drop bundles a 1.16M-video training set and the PointMotionBench eval, and lifts a robot pick-and-place baseline from 56.0% to 76.3%.

MolmoMotion announcement banner showing 3D point trajectories overlaid on a video frame

MolmoMotion predicts how points on objects move in 3D from a video frame and a text instruction, with weights, a 1.16M-video dataset, and a benchmark.

Quick facts

MakerAllen Institute for AI (Ai2)
VariantsMolmoMotion-AR (autoregressive), MolmoMotion-FM (flow-matching)
InputVideo frame + marked points + text instruction
Output3D point trajectories over the next few seconds
Training setMolmoMotion-1M — 1.16M videos, 736 motion types, 5.6K objects
BenchmarkPointMotionBench — 2.7K human-validated clips
Robot result76.3% pick-and-place success vs 56.0% baseline

What is it?

MolmoMotion is a pair of open motion-forecasting models from Ai2. You give MolmoMotion a single video frame, a few marked points on objects, and a text instruction like 'pour the water into the cup,' and MolmoMotion predicts where those points travel through 3D space over the next few seconds.

How does it work?

MolmoMotion ships in two variants. MolmoMotion-AR is autoregressive — it writes future 3D coordinates as structured text, step by step, on top of an Ai2 vision-language backbone. MolmoMotion-FM uses flow matching — it learns to transform random noise into a full trajectory in continuous 3D space, which captures uncertainty better when multiple futures are plausible. Both are trained on MolmoMotion-1M, a new 1.16M-video corpus where every clip carries action-described, object-grounded 3D point trajectories spanning 736 motion types across 5.6K distinct objects.

Why does it matter?

Motion forecasting is the bridge between perception and action — for both robots and video generators. MolmoMotion outperforms every existing 3D motion forecaster on the new PointMotionBench eval and lifts a robot pick-and-place success rate from 56.0% to 76.3%. Ai2 also released MolmoMotion-1M and PointMotionBench, so other labs can train and compare without rebuilding the corpus.

Who is it for?

Robotics researchers, video-generation researchers, multimodal ML practitioners

Frequently asked questions

What is MolmoMotion?
MolmoMotion is a pair of open models from the Allen Institute for AI, released on June 17, 2026, that predict 3D point trajectories on objects from a video frame, a few marked points, and a natural-language instruction. MolmoMotion ships in two variants — MolmoMotion-AR (autoregressive) and MolmoMotion-FM (flow-matching) — together with the 1.16M-video MolmoMotion-1M training set and the PointMotionBench evaluation.
How well does MolmoMotion perform on robots?
MolmoMotion completes 76.3% of pick-and-place tasks in robotics evaluations, versus 56.0% for the prior baseline reported in the announcement. On the dedicated PointMotionBench eval, MolmoMotion outperforms all existing 3D motion-forecasting methods, according to the Ai2 release.
What is the difference between MolmoMotion-AR and MolmoMotion-FM?
MolmoMotion-AR predicts future coordinates step by step and represents 3D coordinates as structured text. MolmoMotion-FM predicts trajectories in continuous 3D space by transforming noise into motion through flow matching, which Ai2 says is better suited for representing uncertainty when multiple plausible futures exist.

Try it

https://huggingface.co/collections/allenai/molmomotion

Sources · 2 outlets

Tags

  • ai2
  • allen-institute
  • molmomotion
  • motion-forecasting
  • 3d
  • robotics
  • video
  • open-weights
  • huggingface
  • dataset
  • benchmark

← All releases · Learn AI