Xiamen University, Shanghai AI Lab, Tencent · 2026-04-06 · notable

Video-MME-v2

Name: Video-MME-v2
Creator: Xiamen University, Shanghai AI Lab, Tencent
Published: 2026-04-06
License: https://creativecommons.org/licenses/by/4.0/
Keywords: benchmark, video, multimodal, vlm

Second-gen comprehensive video-understanding benchmark: three-tier complexity and group-based grading that penalises inconsistent answers, built with over 3,300 human-hours of annotation.

Video-MME-v2 GitHub repository social card

The Video-MME benchmark gets a harder, more honest successor — with grading that penalises models for inconsistent answers.

Key specs

Annotation hours	3,300
Annotators	12
Reviewers	50

What is it?

Video-MME-v2 is a new evaluation benchmark for video-understanding models from a coalition including researchers at Xiamen University, Shanghai AI Lab and Tencent. It is the sequel to the original CVPR 2025 Video-MME, which frontier models have largely saturated. Version 2 is designed to expose where video-LLMs actually fail.

How does it work?

The benchmark has a three-tier complexity structure: multi-point visual information aggregation, temporal dynamics modelling, and multimodal reasoning that combines the two. Rather than plain accuracy, it uses group-based assessment that rewards coherent reasoning and penalises inconsistent answers across related questions. Data was curated by 12 annotators and 50 reviewers over roughly 3,300 human-hours with up to five review cycles.

Why does it matter?

If you are building or buying a video-language model, the old Video-MME is no longer a useful ranking signal — everyone scores high. Video-MME-v2 restores the signal and, more importantly, isolates which stage (visual grounding, temporal modelling, reasoning) is dragging a given model down.

Who is it for?

VLM researchers, teams shipping video-AI products.