Qwen2.5-VL

Alibaba's open-weight vision-language model line for documents, video and on-screen agents

Overview

Qwen2.5-VL is the vision-language model series from Alibaba's Qwen team, released open-weight in late January 2025 at 3B, 7B and 72B sizes, with a 32B variant added in March 2025. It is the successor to Qwen2-VL and the multimodal counterpart to the Qwen2.5 LLM line, processing text, images and video in a single model.

Beyond standard image captioning and VQA, Qwen2.5-VL targets practical document and agent work. It parses charts, tables, forms and full page layouts (outputting structured QwenVL-HTML), grounds objects with bounding boxes and points in JSON, reads long videos with second-level event localization, and can act as a visual agent that drives computer and phone UIs without task-specific fine-tuning.

The flagship Qwen2.5-VL-72B-Instruct scores 70.2 on MMMU and 96.4 on DocVQA, competitive with closed models like GPT-4o on several document and chart tasks. Licensing varies by size: the 7B and 32B weights are Apache 2.0, the 72B uses the Qwen License, and the 3B is restricted to non-commercial use under the Qwen Research License.

Released	2025-01-26
License	Mixed by size: Qwen2.5-VL-72B-Instruct under the Qwen License; 7B and 32B under Apache 2.0; 3B under the Qwen Research License (non-commercial only).
Weights	Open weights
Parameters	Four open sizes: 3B, 7B, 32B and 72B (each in base and instruction-tuned variants)
Context	32,768 tokens (extendable to ~64K for long video via YaRN; some API providers expose up to 131K)
Max output	Not separately documented; bounded by the 32,768-token context window
Architecture	Decoder-only LLM paired with a from-scratch native dynamic-resolution Vision Transformer (ViT) using RMSNorm and SwiGLU. The ViT mixes full attention (4 layers) with windowed attention (max 8x8 window) to cut compute on high-resolution inputs. Video is handled with dynamic-FPS sampling and an updated multimodal rotary position embedding (mRoPE) aligned to absolute time, enabling second-level temporal grounding.
Knowledge cutoff	Not officially published by Alibaba
Modalities	text, image, video
Status	Available (open weights). Superseded by Qwen3-VL but still widely served; the 72B flagship is the strongest tier of the Qwen2.5 generation.

Benchmarks

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Pricing

Input	$0.80 per 1M input tokens per 1M tokens
Output	$1.00 per 1M output tokens per 1M tokens

Pricing shown is for Qwen2.5-VL-72B-Instruct via OpenRouter; the open weights can also be self-hosted for free, and per-token rates vary by provider.

Pricing source ↗

Strengths

State-of-the-art open-weight document and OCR understanding (DocVQA 96.4, ChartQA 89.5) for invoices, forms, charts and dense screenshots
Built-in visual agent capabilities: parses GUIs and emits actions for computer/phone use without extra fine-tuning
Precise visual grounding with bounding-box and point coordinates returned in standardized JSON
Long-video comprehension with second-level temporal localization via dynamic-FPS sampling and time-aligned mRoPE
Four open sizes (3B to 72B) let teams trade accuracy for latency and cost; 7B and 32B are permissively Apache-2.0 licensed
Efficient native dynamic-resolution ViT with windowed attention reduces cost on high-resolution images

Best for

Document AI: extracting text, tables and layout from invoices, contracts, research papers and scanned forms
Chart, diagram and screenshot understanding for analytics and BI workflows
Visual/UI agents that read a screen and operate computer or mobile interfaces
Video analysis with event localization and timestamped question answering on hour-long clips
Object detection and grounding tasks that need JSON bounding boxes or point coordinates
On-prem or self-hosted multimodal deployment using the Apache-2.0 7B/32B weights

How to access

Provider	Model ID
OpenRouter ↗	`qwen/qwen2.5-vl-72b-instruct`
Alibaba Cloud Model Studio ↗	`qwen2.5-vl-72b-instruct`
Hugging Face ↗	`Qwen/Qwen2.5-VL-72B-Instruct`

Qwen-VL — every version

The full lineage of the Qwen-VL line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

Version	Released	Context	License
Qwen3-VLcurrent	2025-09-23	—	Apache-2.0
Qwen2.5-VL	2025-01	—	Open weights
Qwen2-VL	2024-12	—	Open weights

FAQ

Is Qwen2.5-VL open source?

The weights are openly downloadable, but licensing depends on size. The 7B and 32B models are released under the permissive Apache 2.0 license. The 72B flagship uses the Qwen License, and the 3B model uses the Qwen Research License, which restricts it to non-commercial use.

What sizes does Qwen2.5-VL come in?

Four parameter sizes: 3B, 7B and 72B were released in late January 2025, and a 32B variant followed in March 2025. Each is available in base and instruction-tuned (Instruct) versions.

What can Qwen2.5-VL do beyond describing images?

It parses documents, charts and full page layouts into structured output, grounds objects with JSON bounding boxes and points, understands hour-long videos with second-level event localization, and acts as a visual agent that can operate computer and phone interfaces without task-specific fine-tuning.

How does Qwen2.5-VL-72B compare to closed models?

On its model card it reports 70.2 on MMMU and 96.4 on DocVQA, making it competitive with proprietary models like GPT-4o on document, chart and OCR-heavy tasks, while remaining one of the strongest open-weight vision-language models of its generation.

// Overview

// Benchmarks

// Pricing

// Strengths

// Best for

// How to access

// Qwen-VL — every version

// FAQ