Overview
QwQ-32B is the flagship release in Alibaba's QwQ reasoning line, open-sourced by the Qwen team in early March 2025. It is a 32.5-billion-parameter dense transformer (31.0B non-embedding) built on the Qwen2.5-32B base and turned into a chain-of-thought 'thinking' model through large-scale reinforcement learning. The whole model ships as open weights under the permissive Apache-2.0 license on Hugging Face and ModelScope, and is small enough to run on a single high-end consumer GPU.
The headline claim is efficiency: despite having only 32B dense parameters, QwQ-32B reaches performance roughly comparable to DeepSeek-R1 — a 671B-parameter Mixture-of-Experts model with about 37B active per token — and outperforms OpenAI's o1-mini on the benchmarks Qwen reported. Qwen credits a two-stage, outcome-based RL pipeline: a first stage that rewards correct answers on math and code (verified by an accuracy checker and a code-execution server), then a second stage that adds instruction-following, tool use, and human-preference alignment without eroding the reasoning gains.
QwQ-32B is text-only with a 131,072-token (131K) context window, using YaRN to extend beyond the base 32K. It marks the production successor to the late-2024 QwQ-32B-Preview and the QVQ-72B-Preview visual-reasoning experiment, and represents the standalone phase of Qwen's reasoning work before 'thinking' folded into the unified Qwen3 family. You can try it on Qwen Chat or call it via Alibaba Cloud's DashScope (model id qwq-32b) and third-party hosts such as OpenRouter.
| Released | 2025-03-05 |
|---|---|
| License | Apache-2.0 |
| Weights | Open weights |
| Parameters | 32.5B total (31.0B non-embedding) |
| Context | 131K |
| Max output | Not separately specified (131,072-token total context; long reasoning traces consume output budget) |
| Architecture | Dense causal-LM transformer with 64 layers and grouped-query attention (40 query heads, 8 key/value heads), using RoPE, SwiGLU, RMSNorm, and attention QKV bias. Built on the Qwen2.5-32B base and post-trained with a two-stage, outcome-rewarded reinforcement-learning scaling approach: a first RL stage for math and coding (accuracy verifier plus code-execution checks) followed by a second stage adding general instruction-following, tool use, and alignment. Native context is 131,072 tokens; inputs beyond ~32K tokens use YaRN length extrapolation. |
| Knowledge cutoff | Not officially disclosed |
| Modalities | Text |
| Status | Generally available (open weights) |
Benchmarks
- AIME2479.5%
- LiveCodeBench63.4%
- LiveBench73.1%
- IFEval83.9%
- BFCL (function calling)66.4%
Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.
Pricing
| Input | $0.66 / 1M tokens per 1M tokens |
|---|---|
| Output | $1.00 / 1M tokens per 1M tokens |
Representative hosted rate tracked by Artificial Analysis (blended ~$0.69/1M). The weights are open (Apache-2.0), so self-hosting is free aside from compute; per-token prices vary by provider on Alibaba Cloud DashScope, OpenRouter, and other hosts.
Strengths
- Open weights under the permissive Apache-2.0 license — free for commercial use, self-hosting, fine-tuning, and distillation
- Strong reasoning at a small footprint: a 32B dense model reaching scores roughly comparable to the 671B DeepSeek-R1 and beating OpenAI o1-mini on Qwen's reported benchmarks
- Competition-grade math and coding via RL scaling — AIME24 79.5, LiveCodeBench 63.4
- Leads its comparison set on general reasoning (LiveBench 73.1) and tool/function calling (BFCL 66.4)
- Runs locally on a single high-end consumer GPU, unlike the much larger MoE reasoning models it competes with
- 131K-token context window (with YaRN extension) for long problems and documents
Best for
- Competition-style mathematics and multi-step logical reasoning
- Coding and algorithmic problem-solving (LiveCodeBench-style tasks)
- Agentic tool use and function-calling workflows
- Self-hosted reasoning deployments where an open, Apache-2.0-licensed model is required
- Running a capable reasoning model locally on a single consumer GPU
- Research and distillation: using QwQ-32B's chain-of-thought to study or train smaller reasoning models
How to access
| Provider | Model ID |
|---|---|
| Alibaba Cloud Model Studio (DashScope) ↗ | qwq-32b |
| OpenRouter ↗ | qwen/qwq-32b |
QwQ / QVQ (reasoning preview) — every version
The full lineage of the QwQ / QVQ (reasoning preview) line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.
| Version | Released | Context | License |
|---|---|---|---|
| QwQ-32Bcurrent | 2025-03-05 | — | Apache-2.0 |
| QVQ-72B-Preview | 2024-12 | — | Open weights |
FAQ
How can a 32B model like QwQ-32B compete with DeepSeek-R1?
QwQ-32B is a dense 32.5B-parameter model built on the Qwen2.5-32B base and post-trained with large-scale reinforcement learning that rewards correct answers on math and code. Qwen reports it reaches performance roughly comparable to DeepSeek-R1 — a 671B Mixture-of-Experts model with about 37B active parameters — and beats OpenAI o1-mini on benchmarks such as AIME24 (79.5), LiveCodeBench (63.4), and LiveBench (73.1). The point of the release is that RL scaling on a strong base can rival far larger reasoning models.
Is QwQ-32B open source and free to use?
Yes. The weights are released under the Apache-2.0 license on Hugging Face and ModelScope, so you can download, self-host, fine-tune, distill, and use them commercially for free. You also pay per token only if you use a hosted endpoint such as Alibaba Cloud DashScope (model id qwq-32b) or OpenRouter.
What is QwQ-32B's context window and parameter count?
It is a dense causal-LM transformer with 32.5 billion total parameters (31.0B non-embedding), 64 layers, and grouped-query attention (40 query heads, 8 key/value heads). Its native context window is 131,072 tokens (about 131K); inputs beyond roughly 32K tokens use YaRN length extrapolation. The model is text-only.
Can I run QwQ-32B locally?
Yes. At 32B dense parameters it is small enough to run on a single high-end consumer GPU (especially with quantization), which is a key selling point versus the much larger Mixture-of-Experts reasoning models it competes with. The Apache-2.0 weights are on Hugging Face and ModelScope, and it works with common local runtimes such as vLLM and Ollama.