AI/TLDR

Qwen2-VL

Alibaba's open vision-language family (2B/7B/72B) with dynamic-resolution vision, M-RoPE, and 20-minute video understanding.

Overview

Qwen2-VL is the second-generation vision-language model line from Alibaba's Qwen team, announced on 29 August 2024 with the 2B and 7B sizes, followed by the 72B instruction-tuned model on 19 September 2024 and the technical report (arXiv 2409.12191). It reads images, multi-page documents, charts, and video, and answers questions about them. The 2B and 7B checkpoints ship under Apache 2.0, while the flagship Qwen2-VL-72B-Instruct is released under Alibaba's own Qwen (tongyi-qianwen) license.

The release's headline ideas are Naive Dynamic Resolution and Multimodal Rotary Position Embedding (M-RoPE). Rather than squashing every image to a fixed size, Qwen2-VL turns an image into a variable number of visual tokens based on its actual resolution, which helps fine-grained tasks like OCR and document parsing. M-RoPE gives the model a single positional scheme that spans text, 2D images, and 3D video, letting Qwen2-VL handle videos longer than 20 minutes for video question-answering.

On standard multimodal benchmarks, Qwen2-VL-72B scores competitively with closed models of its era on document and OCR-heavy tasks: 96.5 on DocVQA, 64.5 on MMMU, 70.5 on MathVista, and 877 on OCRBench. The smaller 2B and 7B models trade accuracy for footprint, making the 2B a practical choice for on-device and edge deployment. The line was later superseded by Qwen2.5-VL (January 2025) and Qwen3-VL, but the Qwen2-VL weights remain freely downloadable from Hugging Face.

Released2024-08
LicenseApache 2.0 (Qwen2-VL-2B and 7B); Qwen license / tongyi-qianwen (Qwen2-VL-72B)
WeightsOpen weights
Parameters2B, 7B, and 72B (dense) variants
Context32,768 tokens (max_position_embeddings)
ArchitectureDense decoder LLM paired with a ~600M-parameter Vision Transformer encoder. Two signature changes over the original Qwen-VL: Naive Dynamic Resolution, which maps an image of any resolution to a variable number of visual tokens instead of a fixed grid, and Multimodal Rotary Position Embedding (M-RoPE), which decomposes positional encoding into 1D textual, 2D spatial, and 3D temporal components so text, images, and video share one positional scheme.
Knowledge cutoffJune 2023
ModalitiesText, Image, Video
StatusSuperseded by Qwen2.5-VL (Jan 2025) and Qwen3-VL (2025). Weights remain openly available on Hugging Face; not formally deprecated.

Benchmarks

  1. DocVQA (test)96.5%
  2. MMMU (val)64.5%
  3. MathVista (testmini)70.5%
  4. RealWorldQA77.8%
  5. MTVQA30.9%
  6. TextVQA (val)85.5%
  7. MMBench-EN86.5%
  8. EgoSchema (video, test)77.9%
  9. MVBench (video)73.6%

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Strengths

  • Strong document and OCR understanding — Qwen2-VL-72B reaches 96.5 on DocVQA and 877 on OCRBench
  • Naive Dynamic Resolution handles arbitrary image sizes without forced downscaling, helping fine-grained text-in-image tasks
  • Long video comprehension (20+ minutes) via M-RoPE's 3D temporal positional encoding
  • Three sizes (2B/7B/72B) span from edge/on-device to server-grade deployment
  • Open weights — 2B and 7B under permissive Apache 2.0, allowing commercial use and fine-tuning
  • Multilingual OCR and multilingual visual QA (MTVQA) coverage

Best for

  • Document, form, and invoice parsing where text-in-image accuracy matters
  • Multilingual OCR and screenshot/UI understanding
  • Chart and diagram question-answering for data extraction
  • Video understanding and summarization of clips longer than 20 minutes
  • On-device or edge multimodal assistants using the compact 2B checkpoint
  • Visual reasoning for mobile, automotive, and robotics agent prototypes

How to access

ProviderModel ID
Hugging Face (Qwen2-VL-72B-Instruct) ↗Qwen/Qwen2-VL-72B-Instruct
Hugging Face (Qwen2-VL-7B-Instruct) ↗Qwen/Qwen2-VL-7B-Instruct
Hugging Face (Qwen2-VL-2B-Instruct) ↗Qwen/Qwen2-VL-2B-Instruct

Qwen-VL — every version

The full lineage of the Qwen-VL line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

VersionReleasedContextLicense
Qwen3-VLcurrent2025-09-23Apache-2.0
Qwen2.5-VL2025-01Open weights
Qwen2-VL2024-12Open weights

FAQ

When was Qwen2-VL released?

Alibaba's Qwen team announced Qwen2-VL on 29 August 2024 with the 2B and 7B sizes. The instruction-tuned 72B model followed on 19 September 2024, and the technical report (arXiv 2409.12191) was published in September 2024.

Is Qwen2-VL open source, and what license does it use?

Yes. Qwen2-VL-2B and Qwen2-VL-7B are released under the permissive Apache 2.0 license, which allows commercial use and fine-tuning. The flagship Qwen2-VL-72B-Instruct uses Alibaba's own Qwen (tongyi-qianwen) license rather than Apache 2.0.

What sizes and context length does Qwen2-VL offer?

Qwen2-VL comes in 2B, 7B, and 72B dense variants. The released checkpoints have a maximum context of 32,768 tokens (max_position_embeddings), and the model can process video longer than 20 minutes thanks to its M-RoPE positional encoding.

How does Qwen2-VL compare on benchmarks?

The 72B model scores 96.5 on DocVQA, 64.5 on MMMU, 70.5 on MathVista, and 877 on OCRBench, making it especially strong on document and OCR-heavy tasks. It was later superseded by Qwen2.5-VL (January 2025) and Qwen3-VL, which improve document parsing, grounding, and long-video understanding.