AI/TLDR

Qwen3-VL

Alibaba's open-weight flagship vision-language model that reads documents, watches hours-long video, and acts as a visual agent.

Overview

Qwen3-VL is the latest generation of Alibaba's Qwen-VL line of vision-language models, with the flagship Qwen3-VL-235B-A22B released on September 23, 2025 under an Apache 2.0 license. It is a Mixture-of-Experts model with 235B total parameters and roughly 22B active per token, shipped in both an Instruct edition and a reasoning-focused Thinking edition. The full family also includes dense 2B, 4B, 8B, and 32B models plus a smaller 30B-A3B MoE, so the same architecture scales from edge devices to cloud servers.

Qwen3-VL handles text, images, video, and documents/PDFs in a single interleaved context of 256K tokens natively, expandable to 1 million tokens via YaRN. That window lets it read entire books and reason over very long videos; Alibaba reports the model can localize events across hours-long footage thanks to its Text-Timestamp Alignment design. OCR has been expanded to 32 languages (up from 19 in the previous generation), with better handling of low light, blur, tilt, rare characters, and long-document structure.

Beyond perception, Qwen3-VL is built to act. It works as a visual agent that recognizes GUI elements on PC and mobile screens, understands their function, invokes tools, and completes multi-step tasks. Alibaba states the Instruct flagship matches or exceeds Gemini 2.5 Pro on major visual-perception benchmarks, while the Thinking edition targets state-of-the-art multimodal reasoning. Open weights are on Hugging Face, and a hosted API (qwen3-vl-plus / qwen3-vl-flash) is available through Alibaba Cloud Model Studio.

Released2025-09-23
LicenseApache 2.0
WeightsOpen weights
Parameters235B total / 22B active (flagship MoE); also dense 2B/4B/8B/32B and MoE 30B-A3B
Context256K
Max output32K
ArchitectureMixture-of-Experts (flagship 235B-A22B) and dense variants, with three vision upgrades over Qwen2.5-VL: Interleaved-MRoPE (full-frequency positional embeddings over time, width, and height for long video), DeepStack (multi-level ViT feature fusion for fine detail), and Text-Timestamp Alignment for precise event localization in video.
Knowledge cutoffNot publicly disclosed
ModalitiesText, Vision, Video, PDF
StatusAvailable

Benchmarks

  1. MathVista85.8%
  2. MathVision74.6%
  3. DocVQA96.5%
  4. OCRBench875%
  5. MMMU-Pro69.3%
  6. ScreenSpot Pro61.8%
  7. AndroidWorld63.7%

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Pricing

Input$0.20 / 1M tokens (0-32K tier) per 1M tokens
Output$1.60 / 1M tokens (0-32K tier) per 1M tokens

Hosted qwen3-vl-plus on Alibaba Cloud Model Studio (international); tiered higher for longer context (up to $0.60 input / $4.80 output at 128K-256K). qwen3-vl-flash is cheaper. Open weights can be self-hosted for free under Apache 2.0; third-party hosts such as OpenRouter list the 235B-A22B Instruct model around $0.20 input / $0.88 output per 1M tokens.

Pricing source ↗

Strengths

  • Open weights under a permissive Apache 2.0 license, free to download, fine-tune, and self-host
  • 256K native context expandable to 1M tokens for long documents and hours-long video
  • Strong visual math and reasoning (MathVista 85.8, MathVision 74.6)
  • High-accuracy document understanding and OCR across 32 languages (DocVQA 96.5, OCRBench 875)
  • Visual-agent capabilities: operates PC and mobile GUIs, recognizes elements, and invokes tools
  • Family scales from a 2B edge model to a 235B MoE flagship, with Instruct and Thinking editions

Best for

  • Document, invoice, and form parsing with multilingual OCR
  • Long-video understanding, search, and timestamp-grounded event localization
  • GUI automation and visual agents that operate apps on PC and mobile
  • Multimodal STEM and visual-math problem solving
  • Chart, diagram, and screenshot question answering
  • Self-hosted multimodal applications where open weights and data control matter

How to access

ProviderModel ID
Alibaba Cloud Model Studio ↗qwen3-vl-plus
OpenRouter ↗qwen/qwen3-vl-235b-a22b-instruct

Qwen-VL — every version

The full lineage of the Qwen-VL line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

VersionReleasedContextLicense
Qwen3-VLcurrent2025-09-23Apache-2.0
Qwen2.5-VL2025-01Open weights
Qwen2-VL2024-12Open weights

FAQ

Is Qwen3-VL open source?

Yes. The Qwen3-VL weights, including the flagship 235B-A22B model and the dense 2B/4B/8B/32B variants, are released under the Apache 2.0 license, so you can download, fine-tune, and self-host them, including for commercial use.

What is the context window of Qwen3-VL?

It supports a 256K-token context window natively, covering interleaved text, images, and video, and can be extended to 1 million tokens using YaRN. That is large enough to process entire books or hours-long videos in a single pass.

How big is the flagship Qwen3-VL model?

The flagship Qwen3-VL-235B-A22B is a Mixture-of-Experts model with 235B total parameters and about 22B active per token. It comes in an Instruct edition and a reasoning-focused Thinking edition; the full release weighs around 471 GB.

Can Qwen3-VL act as an agent or do OCR?

Yes. Qwen3-VL works as a visual agent that recognizes GUI elements on PC and mobile screens and invokes tools to complete tasks, and its OCR now covers 32 languages with strong document-parsing results (DocVQA 96.5, OCRBench 875).