AI/TLDR

Qwen2.5-VL

Alibaba's open-weight vision-language model line for documents, video and on-screen agents

Overview

Qwen2.5-VL is the vision-language model series from Alibaba's Qwen team, released open-weight in late January 2025 at 3B, 7B and 72B sizes, with a 32B variant added in March 2025. It is the successor to Qwen2-VL and the multimodal counterpart to the Qwen2.5 LLM line, processing text, images and video in a single model.

Beyond standard image captioning and VQA, Qwen2.5-VL targets practical document and agent work. It parses charts, tables, forms and full page layouts (outputting structured QwenVL-HTML), grounds objects with bounding boxes and points in JSON, reads long videos with second-level event localization, and can act as a visual agent that drives computer and phone UIs without task-specific fine-tuning.

The flagship Qwen2.5-VL-72B-Instruct scores 70.2 on MMMU and 96.4 on DocVQA, competitive with closed models like GPT-4o on several document and chart tasks. Licensing varies by size: the 7B and 32B weights are Apache 2.0, the 72B uses the Qwen License, and the 3B is restricted to non-commercial use under the Qwen Research License.

Released2025-01-26
LicenseMixed by size: Qwen2.5-VL-72B-Instruct under the Qwen License; 7B and 32B under Apache 2.0; 3B under the Qwen Research License (non-commercial only).
WeightsOpen weights
ParametersFour open sizes: 3B, 7B, 32B and 72B (each in base and instruction-tuned variants)
Context32,768 tokens (extendable to ~64K for long video via YaRN; some API providers expose up to 131K)
Max outputNot separately documented; bounded by the 32,768-token context window
ArchitectureDecoder-only LLM paired with a from-scratch native dynamic-resolution Vision Transformer (ViT) using RMSNorm and SwiGLU. The ViT mixes full attention (4 layers) with windowed attention (max 8x8 window) to cut compute on high-resolution inputs. Video is handled with dynamic-FPS sampling and an updated multimodal rotary position embedding (mRoPE) aligned to absolute time, enabling second-level temporal grounding.
Knowledge cutoffNot officially published by Alibaba
Modalitiestext, image, video
StatusAvailable (open weights). Superseded by Qwen3-VL but still widely served; the 72B flagship is the strongest tier of the Qwen2.5 generation.

Benchmarks

  1. MMMU (val)70.2%
  2. MMMU-Pro51.1%
  3. DocVQA (val)96.4%
  4. ChartQA (test)89.5%
  5. MathVista (mini)74.8%
  6. MathVision (full)38.1%
  7. MMBench (dev EN)88%
  8. MMStar70.8%
  9. OCRBench885%
  10. Video-MME (w/o subtitles)73.3%
  11. ScreenSpot Pro43.6%
  12. Android Control (high)93.7%

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Pricing

Input$0.80 per 1M input tokens per 1M tokens
Output$1.00 per 1M output tokens per 1M tokens

Pricing shown is for Qwen2.5-VL-72B-Instruct via OpenRouter; the open weights can also be self-hosted for free, and per-token rates vary by provider.

Pricing source ↗

Strengths

  • State-of-the-art open-weight document and OCR understanding (DocVQA 96.4, ChartQA 89.5) for invoices, forms, charts and dense screenshots
  • Built-in visual agent capabilities: parses GUIs and emits actions for computer/phone use without extra fine-tuning
  • Precise visual grounding with bounding-box and point coordinates returned in standardized JSON
  • Long-video comprehension with second-level temporal localization via dynamic-FPS sampling and time-aligned mRoPE
  • Four open sizes (3B to 72B) let teams trade accuracy for latency and cost; 7B and 32B are permissively Apache-2.0 licensed
  • Efficient native dynamic-resolution ViT with windowed attention reduces cost on high-resolution images

Best for

  • Document AI: extracting text, tables and layout from invoices, contracts, research papers and scanned forms
  • Chart, diagram and screenshot understanding for analytics and BI workflows
  • Visual/UI agents that read a screen and operate computer or mobile interfaces
  • Video analysis with event localization and timestamped question answering on hour-long clips
  • Object detection and grounding tasks that need JSON bounding boxes or point coordinates
  • On-prem or self-hosted multimodal deployment using the Apache-2.0 7B/32B weights

How to access

ProviderModel ID
OpenRouter ↗qwen/qwen2.5-vl-72b-instruct
Alibaba Cloud Model Studio ↗qwen2.5-vl-72b-instruct
Hugging Face ↗Qwen/Qwen2.5-VL-72B-Instruct

Qwen-VL — every version

The full lineage of the Qwen-VL line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

VersionReleasedContextLicense
Qwen3-VLcurrent2025-09-23Apache-2.0
Qwen2.5-VL2025-01Open weights
Qwen2-VL2024-12Open weights

FAQ

Is Qwen2.5-VL open source?

The weights are openly downloadable, but licensing depends on size. The 7B and 32B models are released under the permissive Apache 2.0 license. The 72B flagship uses the Qwen License, and the 3B model uses the Qwen Research License, which restricts it to non-commercial use.

What sizes does Qwen2.5-VL come in?

Four parameter sizes: 3B, 7B and 72B were released in late January 2025, and a 32B variant followed in March 2025. Each is available in base and instruction-tuned (Instruct) versions.

What can Qwen2.5-VL do beyond describing images?

It parses documents, charts and full page layouts into structured output, grounds objects with JSON bounding boxes and points, understands hour-long videos with second-level event localization, and acts as a visual agent that can operate computer and phone interfaces without task-specific fine-tuning.

How does Qwen2.5-VL-72B compare to closed models?

On its model card it reports 70.2 on MMMU and 96.4 on DocVQA, making it competitive with proprietary models like GPT-4o on document, chart and OCR-heavy tasks, while remaining one of the strongest open-weight vision-language models of its generation.