AI/TLDR

QVQ-72B-Preview

Alibaba's experimental open-weight visual-reasoning model: built on Qwen2-VL-72B, it 'thinks out loud' step by step over images. Research preview, December 2024.

Overview

QVQ-72B-Preview is an experimental open-weight visual-reasoning model released by Alibaba's Qwen team on 25 December 2024. It is built on the Qwen2-VL-72B vision-language model and adds a chain-of-thought 'thinking' stage: given an image and a question, QVQ streams a long, step-by-step reasoning trace before settling on an answer — the visual counterpart to a text reasoning model like OpenAI o1. Qwen positioned it as a research preview to explore reasoning over non-textual (visual) inputs, and shipped the weights publicly on Hugging Face and GitHub.

On Qwen's reported benchmarks, QVQ-72B-Preview scores 70.3% on MMMU (val), a clear lift over its Qwen2-VL-72B base, with strong math-and-vision results on MathVista (71.4%), MathVision (35.9%), and OlympiadBench (20.4%). Qwen framed it as narrowing the gap with the leading o1 model on multimodal reasoning. It was first published with an Apache-2.0 tag that the team quickly corrected to the Qwen License; the weights remain openly downloadable with commercial use permitted.

QVQ-72B-Preview was explicitly experimental. The model card warns of language mixing and code-switching, recursive reasoning loops that produce verbose output, and a tendency to lose focus on the image during long reasoning and hallucinate. It supports only single-round dialogue and does not accept video, and Qwen said it could not fully replace Qwen2-VL-72B-Instruct. It sits in Qwen's QwQ/QVQ reasoning-preview line alongside QwQ-32B-Preview; that work later matured into QwQ-32B (March 2025) and was ultimately merged into the unified Qwen3 'thinking' family, so QVQ-72B-Preview is best understood as a historical milestone rather than a model to deploy today.

Released2024-12-25
LicenseQwen License (license: other, license_name: qwen) — open weights, commercial use permitted
WeightsOpen weights
Parameters~73B (built on the Qwen2-VL-72B base; the name uses the 72B base size, aggregators report ~73B with the vision encoder)
Context~32K (inherited from the Qwen2-VL-72B base; QVQ's own card does not state a separate figure)
Max outputNot separately specified — the model card's example uses max_new_tokens=8192; long reasoning traces consume the output budget
ArchitectureVision-language transformer built on Qwen2-VL-72B: a dense Qwen2-72B language backbone paired with the Qwen2-VL vision encoder, post-trained to produce long chain-of-thought ('thinking') reasoning over images. It is an experimental preview, supports single-round dialogue only, takes image + text input (no video), and emits a long step-by-step reasoning trace before its answer.
Knowledge cutoffNot officially disclosed (inherits the Qwen2-VL-72B base)
ModalitiesText, Vision
StatusResearch preview (experimental) — superseded by QwQ-32B and later folded into Qwen3 'thinking' models; not maintained as a production model

Benchmarks

  1. MMMU (val)70.3%
  2. MathVista (mini)71.4%
  3. MathVision (full)35.9%
  4. OlympiadBench20.4%

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Strengths

  • First widely available open-weight VISUAL reasoning model — 'thinks out loud' step by step over an image, not just over text
  • Strong multimodal-reasoning benchmarks for late 2024: MMMU 70.3% (near Claude 3.5 Sonnet) and MathVista 71.4% (ahead of the reported GPT-4o and Claude 3.5 Sonnet figures)
  • Clear gains over its own Qwen2-VL-72B base on math-and-vision tasks (e.g. MathVision 35.9% vs 25.9%, OlympiadBench 20.4% vs 11.2%)
  • Open weights under the Qwen License with commercial use permitted — downloadable on Hugging Face and GitHub for self-hosting and research
  • Transparent, inspectable chain-of-thought makes its visual reasoning easy to study and debug

Best for

  • Research into multimodal / visual chain-of-thought reasoning
  • Step-by-step reasoning over diagrams, charts, math problems, and figures in images
  • Visual math and science problem-solving (MathVista / MathVision / OlympiadBench-style tasks)
  • Self-hosted experimentation with an open-weight visual reasoning model
  • Baseline or teacher for studying and distilling visual reasoning traces

How to access

ProviderModel ID
Hugging Face (download weights / Transformers) ↗Qwen/QVQ-72B-Preview

QwQ / QVQ (reasoning preview) — every version

The full lineage of the QwQ / QVQ (reasoning preview) line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

VersionReleasedContextLicense
QwQ-32Bcurrent2025-03-05Apache-2.0
QVQ-72B-Preview2024-12Open weights

FAQ

What is QVQ-72B-Preview?

QVQ-72B-Preview is an experimental open-weight visual-reasoning model released by Alibaba's Qwen team on 25 December 2024. Built on Qwen2-VL-72B, it adds a chain-of-thought 'thinking' stage: given an image and a question, it streams a long step-by-step reasoning trace before answering — effectively a visual counterpart to text reasoning models like OpenAI o1. It was published as a research preview, not a production model.

How well does QVQ-72B-Preview perform on benchmarks?

On the numbers Qwen reported on the model card, QVQ-72B-Preview scores 70.3% on MMMU (val), 71.4% on MathVista (mini), 35.9% on MathVision (full), and 20.4% on OlympiadBench. These are clear gains over its Qwen2-VL-72B base on visual-math tasks, and Qwen described the model as narrowing the gap with the leading o1 model on multimodal reasoning.

Is QVQ-72B-Preview open source, and what is its license?

The weights are open and downloadable on Hugging Face and GitHub. The license is the Qwen License (tagged license: other, license_name: qwen) and permits commercial use. It was briefly published with an Apache-2.0 tag at launch, which the Qwen team quickly corrected to the Qwen License.

Should I use QVQ-72B-Preview today, or something newer?

QVQ-72B-Preview is a historical research preview with known limitations — language mixing, recursive reasoning loops, single-round-only dialogue, no video input, and a tendency to lose track of the image and hallucinate during long reasoning. Qwen's reasoning work has since moved on through QwQ-32B (March 2025) and into the unified Qwen3 'thinking' family, so for production visual reasoning a current Qwen (or other up-to-date) model is the better choice.