Overview
Qwen2-VL is the second-generation vision-language model line from Alibaba's Qwen team, announced on 29 August 2024 with the 2B and 7B sizes, followed by the 72B instruction-tuned model on 19 September 2024 and the technical report (arXiv 2409.12191). It reads images, multi-page documents, charts, and video, and answers questions about them. The 2B and 7B checkpoints ship under Apache 2.0, while the flagship Qwen2-VL-72B-Instruct is released under Alibaba's own Qwen (tongyi-qianwen) license.
The release's headline ideas are Naive Dynamic Resolution and Multimodal Rotary Position Embedding (M-RoPE). Rather than squashing every image to a fixed size, Qwen2-VL turns an image into a variable number of visual tokens based on its actual resolution, which helps fine-grained tasks like OCR and document parsing. M-RoPE gives the model a single positional scheme that spans text, 2D images, and 3D video, letting Qwen2-VL handle videos longer than 20 minutes for video question-answering.
On standard multimodal benchmarks, Qwen2-VL-72B scores competitively with closed models of its era on document and OCR-heavy tasks: 96.5 on DocVQA, 64.5 on MMMU, 70.5 on MathVista, and 877 on OCRBench. The smaller 2B and 7B models trade accuracy for footprint, making the 2B a practical choice for on-device and edge deployment. The line was later superseded by Qwen2.5-VL (January 2025) and Qwen3-VL, but the Qwen2-VL weights remain freely downloadable from Hugging Face.
| Released | 2024-08 |
|---|---|
| License | Apache 2.0 (Qwen2-VL-2B and 7B); Qwen license / tongyi-qianwen (Qwen2-VL-72B) |
| Weights | Open weights |
| Parameters | 2B, 7B, and 72B (dense) variants |
| Context | 32,768 tokens (max_position_embeddings) |
| Architecture | Dense decoder LLM paired with a ~600M-parameter Vision Transformer encoder. Two signature changes over the original Qwen-VL: Naive Dynamic Resolution, which maps an image of any resolution to a variable number of visual tokens instead of a fixed grid, and Multimodal Rotary Position Embedding (M-RoPE), which decomposes positional encoding into 1D textual, 2D spatial, and 3D temporal components so text, images, and video share one positional scheme. |
| Knowledge cutoff | June 2023 |
| Modalities | Text, Image, Video |
| Status | Superseded by Qwen2.5-VL (Jan 2025) and Qwen3-VL (2025). Weights remain openly available on Hugging Face; not formally deprecated. |
Benchmarks
- DocVQA (test)96.5%
- MMMU (val)64.5%
- MathVista (testmini)70.5%
- RealWorldQA77.8%
- MTVQA30.9%
- TextVQA (val)85.5%
- MMBench-EN86.5%
- EgoSchema (video, test)77.9%
- MVBench (video)73.6%
Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.
Strengths
- Strong document and OCR understanding — Qwen2-VL-72B reaches 96.5 on DocVQA and 877 on OCRBench
- Naive Dynamic Resolution handles arbitrary image sizes without forced downscaling, helping fine-grained text-in-image tasks
- Long video comprehension (20+ minutes) via M-RoPE's 3D temporal positional encoding
- Three sizes (2B/7B/72B) span from edge/on-device to server-grade deployment
- Open weights — 2B and 7B under permissive Apache 2.0, allowing commercial use and fine-tuning
- Multilingual OCR and multilingual visual QA (MTVQA) coverage
Best for
- Document, form, and invoice parsing where text-in-image accuracy matters
- Multilingual OCR and screenshot/UI understanding
- Chart and diagram question-answering for data extraction
- Video understanding and summarization of clips longer than 20 minutes
- On-device or edge multimodal assistants using the compact 2B checkpoint
- Visual reasoning for mobile, automotive, and robotics agent prototypes
How to access
| Provider | Model ID |
|---|---|
| Hugging Face (Qwen2-VL-72B-Instruct) ↗ | Qwen/Qwen2-VL-72B-Instruct |
| Hugging Face (Qwen2-VL-7B-Instruct) ↗ | Qwen/Qwen2-VL-7B-Instruct |
| Hugging Face (Qwen2-VL-2B-Instruct) ↗ | Qwen/Qwen2-VL-2B-Instruct |
Qwen-VL — every version
The full lineage of the Qwen-VL line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.
| Version | Released | Context | License |
|---|---|---|---|
| Qwen3-VLcurrent | 2025-09-23 | — | Apache-2.0 |
| Qwen2.5-VL | 2025-01 | — | Open weights |
| Qwen2-VL | 2024-12 | — | Open weights |
FAQ
When was Qwen2-VL released?
Alibaba's Qwen team announced Qwen2-VL on 29 August 2024 with the 2B and 7B sizes. The instruction-tuned 72B model followed on 19 September 2024, and the technical report (arXiv 2409.12191) was published in September 2024.
Is Qwen2-VL open source, and what license does it use?
Yes. Qwen2-VL-2B and Qwen2-VL-7B are released under the permissive Apache 2.0 license, which allows commercial use and fine-tuning. The flagship Qwen2-VL-72B-Instruct uses Alibaba's own Qwen (tongyi-qianwen) license rather than Apache 2.0.
What sizes and context length does Qwen2-VL offer?
Qwen2-VL comes in 2B, 7B, and 72B dense variants. The released checkpoints have a maximum context of 32,768 tokens (max_position_embeddings), and the model can process video longer than 20 minutes thanks to its M-RoPE positional encoding.
How does Qwen2-VL compare on benchmarks?
The 72B model scores 96.5 on DocVQA, 64.5 on MMMU, 70.5 on MathVista, and 877 on OCRBench, making it especially strong on document and OCR-heavy tasks. It was later superseded by Qwen2.5-VL (January 2025) and Qwen3-VL, which improve document parsing, grounding, and long-video understanding.