Alaya Studio / University of Tokyo / Shanghai Innovation Institute · 2026-04-23 · notable
WorldMark — First Unified Benchmark for Interactive Video World Models
WorldMark standardizes evaluation for interactive I2V world models (Genie 3, YUME 1.5, HY-World 1.5, Matrix-Game 2.0) with a shared WASD action vocabulary, 500 test cases, and 8 metrics. Key finding: visual quality and world consistency are uncorrelated — YUME has the best frames but poor coherence; Genie 3 is most consistent but moderate fidelity.

The first benchmark that compares Genie 3, YUME 1.5, HY-World, and Matrix-Game head-to-head on identical scenes and action sequences.
Key specs
| Test cases | 500 |
|---|---|
| Models evaluated | 6 |
| Hugging face upvotes | 31 |
What is it?
WorldMark is a benchmark suite for interactive Image-to-Video (I2V) world models — systems that generate continuous video in response to keyboard movement commands. Before WorldMark, each model used its own private benchmark with proprietary scenes and action formats, making cross-model comparison impossible. WorldMark provides a shared WASD+L/R action vocabulary, adapter scripts that translate into each model's native format, 500 standardized test cases, and an 8-metric evaluation toolkit.
How does it work?
The benchmark standardizes three inputs: 50 reference scenes (first and third-person, realistic and stylized, yielding 100 test images), 15 action trajectories of increasing complexity (Easy 20s to Hard 60s), and per-model adapter scripts. Metrics cover Visual Quality (Aesthetic Quality, Imaging Quality), Control Alignment (Translation and Rotation Error using DROID-SLAM reconstruction), and World Consistency (Reprojection Error plus three VLM-scored dimensions: State, Content, and Style Consistency).
Why does it matter?
The paper's cross-model evaluation reveals findings invisible without a common framework: visual quality and world consistency are largely uncorrelated; YUME 1.5 has the best frames but loses coherence quickly; Genie 3 maintains the most consistent world state with only moderate visual fidelity; Matrix-Game 2.0's rotation error degrades 20x when switching from first to third-person; and Open-Oasis, trained on Minecraft, fails entirely on real-world scenes. These tradeoffs are actionable for teams building or choosing a world model.
Who is it for?
Researchers building or evaluating interactive video generation and world model systems
Try it
https://arxiv.org/abs/2604.21686