Alaya Studio / University of Tokyo / Shanghai Innovation Institute · 2026-04-23 · notable

WorldMark — First Unified Benchmark for Interactive Video World Models

WorldMark standardizes evaluation for interactive I2V world models (Genie 3, YUME 1.5, HY-World 1.5, Matrix-Game 2.0) with a shared WASD action vocabulary, 500 test cases, and 8 metrics. Key finding: visual quality and world consistency are uncorrelated — YUME has the best frames but poor coherence; Genie 3 is most consistent but moderate fidelity.

WorldMark benchmark overview — unified evaluation framework comparing YUME, HY-World, Matrix-Game, Genie 3 on identical test scenes

The first benchmark that compares Genie 3, YUME 1.5, HY-World, and Matrix-Game head-to-head on identical scenes and action sequences.

Key specs

Test cases	500
Models evaluated	6
Hugging face upvotes	31

What is it?

WorldMark is a benchmark suite for interactive Image-to-Video (I2V) world models — systems that generate continuous video in response to keyboard movement commands. Before WorldMark, each model used its own private benchmark with proprietary scenes and action formats, making cross-model comparison impossible. WorldMark provides a shared WASD+L/R action vocabulary, adapter scripts that translate into each model's native format, 500 standardized test cases, and an 8-metric evaluation toolkit.

How does it work?

The benchmark standardizes three inputs: 50 reference scenes (first and third-person, realistic and stylized, yielding 100 test images), 15 action trajectories of increasing complexity (Easy 20s to Hard 60s), and per-model adapter scripts. Metrics cover Visual Quality (Aesthetic Quality, Imaging Quality), Control Alignment (Translation and Rotation Error using DROID-SLAM reconstruction), and World Consistency (Reprojection Error plus three VLM-scored dimensions: State, Content, and Style Consistency).

Why does it matter?

The paper's cross-model evaluation reveals findings invisible without a common framework: visual quality and world consistency are largely uncorrelated; YUME 1.5 has the best frames but loses coherence quickly; Genie 3 maintains the most consistent world state with only moderate visual fidelity; Matrix-Game 2.0's rotation error degrades 20x when switching from first to third-person; and Open-Oasis, trained on Minecraft, fails entirely on real-world scenes. These tradeoffs are actionable for teams building or choosing a world model.

Who is it for?

Researchers building or evaluating interactive video generation and world model systems

Try it

https://arxiv.org/abs/2604.21686