Overview
Qwen3-VL is the latest generation of Alibaba's Qwen-VL line of vision-language models, with the flagship Qwen3-VL-235B-A22B released on September 23, 2025 under an Apache 2.0 license. It is a Mixture-of-Experts model with 235B total parameters and roughly 22B active per token, shipped in both an Instruct edition and a reasoning-focused Thinking edition. The full family also includes dense 2B, 4B, 8B, and 32B models plus a smaller 30B-A3B MoE, so the same architecture scales from edge devices to cloud servers.
Qwen3-VL handles text, images, video, and documents/PDFs in a single interleaved context of 256K tokens natively, expandable to 1 million tokens via YaRN. That window lets it read entire books and reason over very long videos; Alibaba reports the model can localize events across hours-long footage thanks to its Text-Timestamp Alignment design. OCR has been expanded to 32 languages (up from 19 in the previous generation), with better handling of low light, blur, tilt, rare characters, and long-document structure.
Beyond perception, Qwen3-VL is built to act. It works as a visual agent that recognizes GUI elements on PC and mobile screens, understands their function, invokes tools, and completes multi-step tasks. Alibaba states the Instruct flagship matches or exceeds Gemini 2.5 Pro on major visual-perception benchmarks, while the Thinking edition targets state-of-the-art multimodal reasoning. Open weights are on Hugging Face, and a hosted API (qwen3-vl-plus / qwen3-vl-flash) is available through Alibaba Cloud Model Studio.
| Released | 2025-09-23 |
|---|---|
| License | Apache 2.0 |
| Weights | Open weights |
| Parameters | 235B total / 22B active (flagship MoE); also dense 2B/4B/8B/32B and MoE 30B-A3B |
| Context | 256K |
| Max output | 32K |
| Architecture | Mixture-of-Experts (flagship 235B-A22B) and dense variants, with three vision upgrades over Qwen2.5-VL: Interleaved-MRoPE (full-frequency positional embeddings over time, width, and height for long video), DeepStack (multi-level ViT feature fusion for fine detail), and Text-Timestamp Alignment for precise event localization in video. |
| Knowledge cutoff | Not publicly disclosed |
| Modalities | Text, Vision, Video, PDF |
| Status | Available |
Benchmarks
- MathVista85.8%
- MathVision74.6%
- DocVQA96.5%
- OCRBench875%
- MMMU-Pro69.3%
- ScreenSpot Pro61.8%
- AndroidWorld63.7%
Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.
Pricing
| Input | $0.20 / 1M tokens (0-32K tier) per 1M tokens |
|---|---|
| Output | $1.60 / 1M tokens (0-32K tier) per 1M tokens |
Hosted qwen3-vl-plus on Alibaba Cloud Model Studio (international); tiered higher for longer context (up to $0.60 input / $4.80 output at 128K-256K). qwen3-vl-flash is cheaper. Open weights can be self-hosted for free under Apache 2.0; third-party hosts such as OpenRouter list the 235B-A22B Instruct model around $0.20 input / $0.88 output per 1M tokens.
Strengths
- Open weights under a permissive Apache 2.0 license, free to download, fine-tune, and self-host
- 256K native context expandable to 1M tokens for long documents and hours-long video
- Strong visual math and reasoning (MathVista 85.8, MathVision 74.6)
- High-accuracy document understanding and OCR across 32 languages (DocVQA 96.5, OCRBench 875)
- Visual-agent capabilities: operates PC and mobile GUIs, recognizes elements, and invokes tools
- Family scales from a 2B edge model to a 235B MoE flagship, with Instruct and Thinking editions
Best for
- Document, invoice, and form parsing with multilingual OCR
- Long-video understanding, search, and timestamp-grounded event localization
- GUI automation and visual agents that operate apps on PC and mobile
- Multimodal STEM and visual-math problem solving
- Chart, diagram, and screenshot question answering
- Self-hosted multimodal applications where open weights and data control matter
How to access
| Provider | Model ID |
|---|---|
| Alibaba Cloud Model Studio ↗ | qwen3-vl-plus |
| OpenRouter ↗ | qwen/qwen3-vl-235b-a22b-instruct |
Qwen-VL — every version
The full lineage of the Qwen-VL line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.
| Version | Released | Context | License |
|---|---|---|---|
| Qwen3-VLcurrent | 2025-09-23 | — | Apache-2.0 |
| Qwen2.5-VL | 2025-01 | — | Open weights |
| Qwen2-VL | 2024-12 | — | Open weights |
FAQ
Is Qwen3-VL open source?
Yes. The Qwen3-VL weights, including the flagship 235B-A22B model and the dense 2B/4B/8B/32B variants, are released under the Apache 2.0 license, so you can download, fine-tune, and self-host them, including for commercial use.
What is the context window of Qwen3-VL?
It supports a 256K-token context window natively, covering interleaved text, images, and video, and can be extended to 1 million tokens using YaRN. That is large enough to process entire books or hours-long videos in a single pass.
How big is the flagship Qwen3-VL model?
The flagship Qwen3-VL-235B-A22B is a Mixture-of-Experts model with 235B total parameters and about 22B active per token. It comes in an Instruct edition and a reasoning-focused Thinking edition; the full release weighs around 471 GB.
Can Qwen3-VL act as an agent or do OCR?
Yes. Qwen3-VL works as a visual agent that recognizes GUI elements on PC and mobile screens and invokes tools to complete tasks, and its OCR now covers 32 languages with strong document-parsing results (DocVQA 96.5, OCRBench 875).