Xiamen University, Shanghai AI Lab, Tencent · 2026-04-06 · notable
Video-MME-v2
Second-gen comprehensive video-understanding benchmark: three-tier complexity and group-based grading that penalises inconsistent answers, built with over 3,300 human-hours of annotation.
The Video-MME benchmark gets a harder, more honest successor — with grading that penalises models for inconsistent answers.
Key specs
| Annotation hours | 3,300 |
|---|---|
| Annotators | 12 |
| Reviewers | 50 |
What is it?
Video-MME-v2 is a new evaluation benchmark for video-understanding models from a coalition including researchers at Xiamen University, Shanghai AI Lab and Tencent. It is the sequel to the original CVPR 2025 Video-MME, which frontier models have largely saturated. Version 2 is designed to expose where video-LLMs actually fail.
How does it work?
The benchmark has a three-tier complexity structure: multi-point visual information aggregation, temporal dynamics modelling, and multimodal reasoning that combines the two. Rather than plain accuracy, it uses group-based assessment that rewards coherent reasoning and penalises inconsistent answers across related questions. Data was curated by 12 annotators and 50 reviewers over roughly 3,300 human-hours with up to five review cycles.
Why does it matter?
If you are building or buying a video-language model, the old Video-MME is no longer a useful ranking signal — everyone scores high. Video-MME-v2 restores the signal and, more importantly, isolates which stage (visual grounding, temporal modelling, reasoning) is dragging a given model down.
Who is it for?
VLM researchers, teams shipping video-AI products.