QVQ-72B-Preview

Alibaba's experimental open-weight visual-reasoning model: built on Qwen2-VL-72B, it 'thinks out loud' step by step over images. Research preview, December 2024.

Overview

QVQ-72B-Preview is an experimental open-weight visual-reasoning model released by Alibaba's Qwen team on 25 December 2024. It is built on the Qwen2-VL-72B vision-language model and adds a chain-of-thought 'thinking' stage: given an image and a question, QVQ streams a long, step-by-step reasoning trace before settling on an answer — the visual counterpart to a text reasoning model like OpenAI o1. Qwen positioned it as a research preview to explore reasoning over non-textual (visual) inputs, and shipped the weights publicly on Hugging Face and GitHub.

On Qwen's reported benchmarks, QVQ-72B-Preview scores 70.3% on MMMU (val), a clear lift over its Qwen2-VL-72B base, with strong math-and-vision results on MathVista (71.4%), MathVision (35.9%), and OlympiadBench (20.4%). Qwen framed it as narrowing the gap with the leading o1 model on multimodal reasoning. It was first published with an Apache-2.0 tag that the team quickly corrected to the Qwen License; the weights remain openly downloadable with commercial use permitted.

QVQ-72B-Preview was explicitly experimental. The model card warns of language mixing and code-switching, recursive reasoning loops that produce verbose output, and a tendency to lose focus on the image during long reasoning and hallucinate. It supports only single-round dialogue and does not accept video, and Qwen said it could not fully replace Qwen2-VL-72B-Instruct. It sits in Qwen's QwQ/QVQ reasoning-preview line alongside QwQ-32B-Preview; that work later matured into QwQ-32B (March 2025) and was ultimately merged into the unified Qwen3 'thinking' family, so QVQ-72B-Preview is best understood as a historical milestone rather than a model to deploy today.

Released	2024-12-25
License	Qwen License (license: other, license_name: qwen) — open weights, commercial use permitted
Weights	Open weights
Parameters	~73B (built on the Qwen2-VL-72B base; the name uses the 72B base size, aggregators report ~73B with the vision encoder)
Context	~32K (inherited from the Qwen2-VL-72B base; QVQ's own card does not state a separate figure)
Max output	Not separately specified — the model card's example uses max_new_tokens=8192; long reasoning traces consume the output budget
Architecture	Vision-language transformer built on Qwen2-VL-72B: a dense Qwen2-72B language backbone paired with the Qwen2-VL vision encoder, post-trained to produce long chain-of-thought ('thinking') reasoning over images. It is an experimental preview, supports single-round dialogue only, takes image + text input (no video), and emits a long step-by-step reasoning trace before its answer.
Knowledge cutoff	Not officially disclosed (inherits the Qwen2-VL-72B base)
Modalities	Text, Vision
Status	Research preview (experimental) — superseded by QwQ-32B and later folded into Qwen3 'thinking' models; not maintained as a production model

Benchmarks

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Strengths

First widely available open-weight VISUAL reasoning model — 'thinks out loud' step by step over an image, not just over text
Strong multimodal-reasoning benchmarks for late 2024: MMMU 70.3% (near Claude 3.5 Sonnet) and MathVista 71.4% (ahead of the reported GPT-4o and Claude 3.5 Sonnet figures)
Clear gains over its own Qwen2-VL-72B base on math-and-vision tasks (e.g. MathVision 35.9% vs 25.9%, OlympiadBench 20.4% vs 11.2%)
Open weights under the Qwen License with commercial use permitted — downloadable on Hugging Face and GitHub for self-hosting and research
Transparent, inspectable chain-of-thought makes its visual reasoning easy to study and debug

Best for

Research into multimodal / visual chain-of-thought reasoning
Step-by-step reasoning over diagrams, charts, math problems, and figures in images
Visual math and science problem-solving (MathVista / MathVision / OlympiadBench-style tasks)
Self-hosted experimentation with an open-weight visual reasoning model
Baseline or teacher for studying and distilling visual reasoning traces

How to access

Provider	Model ID
Hugging Face (download weights / Transformers) ↗	`Qwen/QVQ-72B-Preview`

QwQ / QVQ (reasoning preview) — every version

The full lineage of the QwQ / QVQ (reasoning preview) line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

Version	Released	Context	License
QwQ-32Bcurrent	2025-03-05	—	Apache-2.0
QVQ-72B-Preview	2024-12	—	Open weights

FAQ

What is QVQ-72B-Preview?

QVQ-72B-Preview is an experimental open-weight visual-reasoning model released by Alibaba's Qwen team on 25 December 2024. Built on Qwen2-VL-72B, it adds a chain-of-thought 'thinking' stage: given an image and a question, it streams a long step-by-step reasoning trace before answering — effectively a visual counterpart to text reasoning models like OpenAI o1. It was published as a research preview, not a production model.

How well does QVQ-72B-Preview perform on benchmarks?

On the numbers Qwen reported on the model card, QVQ-72B-Preview scores 70.3% on MMMU (val), 71.4% on MathVista (mini), 35.9% on MathVision (full), and 20.4% on OlympiadBench. These are clear gains over its Qwen2-VL-72B base on visual-math tasks, and Qwen described the model as narrowing the gap with the leading o1 model on multimodal reasoning.

Is QVQ-72B-Preview open source, and what is its license?

The weights are open and downloadable on Hugging Face and GitHub. The license is the Qwen License (tagged license: other, license_name: qwen) and permits commercial use. It was briefly published with an Apache-2.0 tag at launch, which the Qwen team quickly corrected to the Qwen License.

Should I use QVQ-72B-Preview today, or something newer?

QVQ-72B-Preview is a historical research preview with known limitations — language mixing, recursive reasoning loops, single-round-only dialogue, no video input, and a tendency to lose track of the image and hallucinate during long reasoning. Qwen's reasoning work has since moved on through QwQ-32B (March 2025) and into the unified Qwen3 'thinking' family, so for production visual reasoning a current Qwen (or other up-to-date) model is the better choice.

// Overview

// Benchmarks

// Strengths

// Best for

// How to access

// QwQ / QVQ (reasoning preview) — every version

// FAQ