Allen Institute for AI · 2026-05-05 · major
MolmoAct 2 — Ai2's Fully Open Bimanual Robotics Model With 720h Open Dataset
Ai2 released MolmoAct 2, a fully open Vision-Language-Action model for real-world robots. It outperforms π₀.₅ on seven benchmarks, hits 87.1% on real-world Franka tasks, runs up to 37x faster than its predecessor, and ships with the largest open bimanual robot dataset to date.

An open VLA model from Ai2 that runs two-armed robots in the real world — weights, code and 720h of bimanual data on day one.
Key specs
| License | Apache 2.0 |
|---|---|
| Libero avg | 97.2% |
| Real world yam | 50.1% |
| Franka real world | 87.1% |
| Speedup vs v1 | up to 37x |
| Molmoact2 bimanual yam hours | 720+ |
| Embodied reasoning avg | 63.8% |
What is it?
MolmoAct 2 is an action reasoning model from the Allen Institute for AI that takes camera frames and a natural-language instruction and outputs continuous robot actions. It targets practical deployment on platforms like the bimanual YAM, low-cost SO-100/101 arms, and the Franka. Ai2 also released the MolmoAct 2-Bimanual YAM dataset — 720+ hours and 146,000 annotations across 28 tasks — which they call the largest open-source bimanual manipulation dataset published.
How does it work?
The model is built on Molmo 2-ER, a new embodied-reasoning Molmo variant trained on 3.3M samples of pointing, detection and abstract spatial reasoning. A flow-matching continuous-action expert is grafted onto the discrete-token VLM via per-layer KV-cache conditioning, with a 'specialize-then-rehearse' recipe that preserves general VLM skills. A separate OpenFAST tokenizer is trained across five embodiments. The MolmoAct 2-Think variant only re-predicts depth tokens for parts of the scene that have changed, cutting latency.
Why does it matter?
Frontier robotics has been gated by closed proprietary systems (π₀.₅, Gemini Robotics ER) and tiny private datasets. Ai2 ships not only competitive numbers — Molmo 2-ER beats GPT-5 and Gemini Robotics ER-1.5 averaged across 13 embodied benchmarks — but the data and training code, which is what makes the result actually reproducible by smaller labs and university groups. Stanford's Cong Lab is already piloting it for CRISPR wetlab work.
Who is it for?
Robotics researchers, embodied-AI labs, anyone building VLAs on top of open weights.
Try it
https://huggingface.co/collections/allenai/molmoact2-models