Qwen2-VL

Alibaba's open vision-language family (2B/7B/72B) with dynamic-resolution vision, M-RoPE, and 20-minute video understanding.

Overview

Qwen2-VL is the second-generation vision-language model line from Alibaba's Qwen team, announced on 29 August 2024 with the 2B and 7B sizes, followed by the 72B instruction-tuned model on 19 September 2024 and the technical report (arXiv 2409.12191). It reads images, multi-page documents, charts, and video, and answers questions about them. The 2B and 7B checkpoints ship under Apache 2.0, while the flagship Qwen2-VL-72B-Instruct is released under Alibaba's own Qwen (tongyi-qianwen) license.

The release's headline ideas are Naive Dynamic Resolution and Multimodal Rotary Position Embedding (M-RoPE). Rather than squashing every image to a fixed size, Qwen2-VL turns an image into a variable number of visual tokens based on its actual resolution, which helps fine-grained tasks like OCR and document parsing. M-RoPE gives the model a single positional scheme that spans text, 2D images, and 3D video, letting Qwen2-VL handle videos longer than 20 minutes for video question-answering.

On standard multimodal benchmarks, Qwen2-VL-72B scores competitively with closed models of its era on document and OCR-heavy tasks: 96.5 on DocVQA, 64.5 on MMMU, 70.5 on MathVista, and 877 on OCRBench. The smaller 2B and 7B models trade accuracy for footprint, making the 2B a practical choice for on-device and edge deployment. The line was later superseded by Qwen2.5-VL (January 2025) and Qwen3-VL, but the Qwen2-VL weights remain freely downloadable from Hugging Face.

Released	2024-08
License	Apache 2.0 (Qwen2-VL-2B and 7B); Qwen license / tongyi-qianwen (Qwen2-VL-72B)
Weights	Open weights
Parameters	2B, 7B, and 72B (dense) variants
Context	32,768 tokens (max_position_embeddings)
Architecture	Dense decoder LLM paired with a ~600M-parameter Vision Transformer encoder. Two signature changes over the original Qwen-VL: Naive Dynamic Resolution, which maps an image of any resolution to a variable number of visual tokens instead of a fixed grid, and Multimodal Rotary Position Embedding (M-RoPE), which decomposes positional encoding into 1D textual, 2D spatial, and 3D temporal components so text, images, and video share one positional scheme.
Knowledge cutoff	June 2023
Modalities	Text, Image, Video
Status	Superseded by Qwen2.5-VL (Jan 2025) and Qwen3-VL (2025). Weights remain openly available on Hugging Face; not formally deprecated.

Benchmarks

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Strengths

Strong document and OCR understanding — Qwen2-VL-72B reaches 96.5 on DocVQA and 877 on OCRBench
Naive Dynamic Resolution handles arbitrary image sizes without forced downscaling, helping fine-grained text-in-image tasks
Long video comprehension (20+ minutes) via M-RoPE's 3D temporal positional encoding
Three sizes (2B/7B/72B) span from edge/on-device to server-grade deployment
Open weights — 2B and 7B under permissive Apache 2.0, allowing commercial use and fine-tuning
Multilingual OCR and multilingual visual QA (MTVQA) coverage

Best for

Document, form, and invoice parsing where text-in-image accuracy matters
Multilingual OCR and screenshot/UI understanding
Chart and diagram question-answering for data extraction
Video understanding and summarization of clips longer than 20 minutes
On-device or edge multimodal assistants using the compact 2B checkpoint
Visual reasoning for mobile, automotive, and robotics agent prototypes

How to access

Provider	Model ID
Hugging Face (Qwen2-VL-72B-Instruct) ↗	`Qwen/Qwen2-VL-72B-Instruct`
Hugging Face (Qwen2-VL-7B-Instruct) ↗	`Qwen/Qwen2-VL-7B-Instruct`
Hugging Face (Qwen2-VL-2B-Instruct) ↗	`Qwen/Qwen2-VL-2B-Instruct`

Qwen-VL — every version

The full lineage of the Qwen-VL line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

Version	Released	Context	License
Qwen3-VLcurrent	2025-09-23	—	Apache-2.0
Qwen2.5-VL	2025-01	—	Open weights
Qwen2-VL	2024-12	—	Open weights

FAQ

When was Qwen2-VL released?

Alibaba's Qwen team announced Qwen2-VL on 29 August 2024 with the 2B and 7B sizes. The instruction-tuned 72B model followed on 19 September 2024, and the technical report (arXiv 2409.12191) was published in September 2024.

Is Qwen2-VL open source, and what license does it use?

Yes. Qwen2-VL-2B and Qwen2-VL-7B are released under the permissive Apache 2.0 license, which allows commercial use and fine-tuning. The flagship Qwen2-VL-72B-Instruct uses Alibaba's own Qwen (tongyi-qianwen) license rather than Apache 2.0.

What sizes and context length does Qwen2-VL offer?

Qwen2-VL comes in 2B, 7B, and 72B dense variants. The released checkpoints have a maximum context of 32,768 tokens (max_position_embeddings), and the model can process video longer than 20 minutes thanks to its M-RoPE positional encoding.

How does Qwen2-VL compare on benchmarks?

The 72B model scores 96.5 on DocVQA, 64.5 on MMMU, 70.5 on MathVista, and 877 on OCRBench, making it especially strong on document and OCR-heavy tasks. It was later superseded by Qwen2.5-VL (January 2025) and Qwen3-VL, which improve document parsing, grounding, and long-video understanding.

// Overview

// Benchmarks

// Strengths

// Best for

// How to access

// Qwen-VL — every version

// FAQ