QwQ-32B

Alibaba's 32B open reasoning model, RL-trained to rival DeepSeek-R1 on math and code — small enough to run on one GPU. Apache-2.0.

Overview

QwQ-32B is the flagship release in Alibaba's QwQ reasoning line, open-sourced by the Qwen team in early March 2025. It is a 32.5-billion-parameter dense transformer (31.0B non-embedding) built on the Qwen2.5-32B base and turned into a chain-of-thought 'thinking' model through large-scale reinforcement learning. The whole model ships as open weights under the permissive Apache-2.0 license on Hugging Face and ModelScope, and is small enough to run on a single high-end consumer GPU.

The headline claim is efficiency: despite having only 32B dense parameters, QwQ-32B reaches performance roughly comparable to DeepSeek-R1 — a 671B-parameter Mixture-of-Experts model with about 37B active per token — and outperforms OpenAI's o1-mini on the benchmarks Qwen reported. Qwen credits a two-stage, outcome-based RL pipeline: a first stage that rewards correct answers on math and code (verified by an accuracy checker and a code-execution server), then a second stage that adds instruction-following, tool use, and human-preference alignment without eroding the reasoning gains.

QwQ-32B is text-only with a 131,072-token (131K) context window, using YaRN to extend beyond the base 32K. It marks the production successor to the late-2024 QwQ-32B-Preview and the QVQ-72B-Preview visual-reasoning experiment, and represents the standalone phase of Qwen's reasoning work before 'thinking' folded into the unified Qwen3 family. You can try it on Qwen Chat or call it via Alibaba Cloud's DashScope (model id qwq-32b) and third-party hosts such as OpenRouter.

Released	2025-03-05
License	Apache-2.0
Weights	Open weights
Parameters	32.5B total (31.0B non-embedding)
Context	131K
Max output	Not separately specified (131,072-token total context; long reasoning traces consume output budget)
Architecture	Dense causal-LM transformer with 64 layers and grouped-query attention (40 query heads, 8 key/value heads), using RoPE, SwiGLU, RMSNorm, and attention QKV bias. Built on the Qwen2.5-32B base and post-trained with a two-stage, outcome-rewarded reinforcement-learning scaling approach: a first RL stage for math and coding (accuracy verifier plus code-execution checks) followed by a second stage adding general instruction-following, tool use, and alignment. Native context is 131,072 tokens; inputs beyond ~32K tokens use YaRN length extrapolation.
Knowledge cutoff	Not officially disclosed
Modalities	Text
Status	Generally available (open weights)

Benchmarks

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Pricing

Input	$0.66 / 1M tokens per 1M tokens
Output	$1.00 / 1M tokens per 1M tokens

Representative hosted rate tracked by Artificial Analysis (blended ~$0.69/1M). The weights are open (Apache-2.0), so self-hosting is free aside from compute; per-token prices vary by provider on Alibaba Cloud DashScope, OpenRouter, and other hosts.

Pricing source ↗

Strengths

Open weights under the permissive Apache-2.0 license — free for commercial use, self-hosting, fine-tuning, and distillation
Strong reasoning at a small footprint: a 32B dense model reaching scores roughly comparable to the 671B DeepSeek-R1 and beating OpenAI o1-mini on Qwen's reported benchmarks
Competition-grade math and coding via RL scaling — AIME24 79.5, LiveCodeBench 63.4
Leads its comparison set on general reasoning (LiveBench 73.1) and tool/function calling (BFCL 66.4)
Runs locally on a single high-end consumer GPU, unlike the much larger MoE reasoning models it competes with
131K-token context window (with YaRN extension) for long problems and documents

Best for

Competition-style mathematics and multi-step logical reasoning
Coding and algorithmic problem-solving (LiveCodeBench-style tasks)
Agentic tool use and function-calling workflows
Self-hosted reasoning deployments where an open, Apache-2.0-licensed model is required
Running a capable reasoning model locally on a single consumer GPU
Research and distillation: using QwQ-32B's chain-of-thought to study or train smaller reasoning models

How to access

Provider	Model ID
Alibaba Cloud Model Studio (DashScope) ↗	`qwq-32b`
OpenRouter ↗	`qwen/qwq-32b`

QwQ / QVQ (reasoning preview) — every version

The full lineage of the QwQ / QVQ (reasoning preview) line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

Version	Released	Context	License
QwQ-32Bcurrent	2025-03-05	—	Apache-2.0
QVQ-72B-Preview	2024-12	—	Open weights

FAQ

How can a 32B model like QwQ-32B compete with DeepSeek-R1?

QwQ-32B is a dense 32.5B-parameter model built on the Qwen2.5-32B base and post-trained with large-scale reinforcement learning that rewards correct answers on math and code. Qwen reports it reaches performance roughly comparable to DeepSeek-R1 — a 671B Mixture-of-Experts model with about 37B active parameters — and beats OpenAI o1-mini on benchmarks such as AIME24 (79.5), LiveCodeBench (63.4), and LiveBench (73.1). The point of the release is that RL scaling on a strong base can rival far larger reasoning models.

Is QwQ-32B open source and free to use?

Yes. The weights are released under the Apache-2.0 license on Hugging Face and ModelScope, so you can download, self-host, fine-tune, distill, and use them commercially for free. You also pay per token only if you use a hosted endpoint such as Alibaba Cloud DashScope (model id qwq-32b) or OpenRouter.

What is QwQ-32B's context window and parameter count?

It is a dense causal-LM transformer with 32.5 billion total parameters (31.0B non-embedding), 64 layers, and grouped-query attention (40 query heads, 8 key/value heads). Its native context window is 131,072 tokens (about 131K); inputs beyond roughly 32K tokens use YaRN length extrapolation. The model is text-only.

Can I run QwQ-32B locally?

Yes. At 32B dense parameters it is small enough to run on a single high-end consumer GPU (especially with quantization), which is a key selling point versus the much larger Mixture-of-Experts reasoning models it competes with. The Apache-2.0 weights are on Hugging Face and ModelScope, and it works with common local runtimes such as vLLM and Ollama.

// Overview

// Benchmarks

// Pricing

// Strengths

// Best for

// How to access

// QwQ / QVQ (reasoning preview) — every version

// FAQ