Overview
Qwen3 is the third-generation open-weight large language model family from Alibaba's Qwen team, released on April 28, 2025. Rather than a single model, it is a full lineup: dense models at 0.6B, 1.7B, 4B, 8B, 14B, and 32B parameters, plus two Mixture-of-Experts models — Qwen3-30B-A3B (30B total, 3B active) and the flagship Qwen3-235B-A22B (235B total, 22B active). Every model is released under the permissive Apache 2.0 license with downloadable weights on Hugging Face, ModelScope, Kaggle, and Ollama.
Qwen3's headline feature is a hybrid design: a single model can operate in a thinking mode that produces a step-by-step reasoning trace before its final answer, or a non-thinking mode that responds directly for low-latency chat. Users (and developers, via a flag or a /think and /no_think control) choose how much the model deliberates per task, so one deployment covers both hard reasoning and fast dialogue without swapping models. Qwen3 was pre-trained on roughly 36 trillion tokens — nearly double Qwen2.5 — spanning 119 languages and dialects, and post-trained with reinforcement learning for math, code, and agentic tool use.
The flagship Qwen3-235B-A22B activates only 22B of its 235B parameters per token, giving it inference cost closer to a mid-size dense model while competing with much larger systems on reasoning and coding benchmarks. The smaller, edge-friendly dense models (0.6B-8B) make Qwen3 practical to run locally on a laptop or single GPU. Hosted API access is available through Alibaba Cloud Model Studio and aggregators such as OpenRouter.
| Released | 2025-04-28 |
|---|---|
| License | Apache 2.0 |
| Weights | Open weights |
| Parameters | 235B total / 22B active (MoE flagship); dense 0.6B-32B and 30B-A3B MoE also released |
| Context | 32K (131K with YaRN) |
| Max output | 32,768 tokens (38,912 for hard reasoning) |
| Architecture | Mixture-of-Experts and dense transformers. The flagship Qwen3-235B-A22B is an MoE with 235B total parameters and 22B activated per token, 94 layers, 128 experts (8 routed per token), and grouped-query attention (64 query heads / 4 key-value heads). The line also ships a 30B-A3B MoE (3B active) and dense models at 0.6B, 1.7B, 4B, 8B, 14B, and 32B. Every model runs a single set of weights in two modes: a thinking mode that emits a reasoning trace before answering, and a non-thinking mode for fast direct replies. Native context is 32,768 tokens, extendable to 131,072 (128K) via YaRN. Pre-trained on roughly 36 trillion tokens across 119 languages and dialects. |
| Modalities | Text |
| Status | Available |
Benchmarks
- AIME 2024 (math, thinking mode)85.7%
- AIME 2025 (math, thinking mode)81.5%
- LiveCodeBench v5 (coding)70.7%
- Arena-Hard (alignment)95.6%
- BFCL v3 (function calling)70.8%
- MMLU-Pro (knowledge)68.18%
Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.
Pricing
| Input | $0.455 / 1M tokens per 1M tokens |
|---|---|
| Output | $1.82 / 1M tokens per 1M tokens |
Rates shown for the flagship Qwen3-235B-A22B via OpenRouter (a 35%-off promotional rate at time of capture). The open weights are free to self-host; smaller dense models are far cheaper or free to run locally.
Strengths
- Fully open weights under Apache 2.0 — commercial-friendly, self-hostable, and downloadable from Hugging Face, ModelScope, Kaggle, and Ollama
- One model, two modes: switch between a deliberate thinking mode and a fast non-thinking mode without changing models
- Wide size range, from a 0.6B dense model that runs on-device to a 235B-A22B MoE flagship for frontier-level reasoning
- Efficient MoE flagship — only 22B of 235B parameters activate per token, keeping inference cost low relative to total scale
- Strong math and coding results (AIME, LiveCodeBench) and high Arena-Hard alignment scores among open models
- Broad multilingual coverage: pre-trained across 119 languages and dialects
Best for
- Self-hosted or private-cloud assistants where open weights and Apache 2.0 licensing are required
- On-device and edge deployment using the small dense models (0.6B-8B) on laptops or single GPUs
- Math, coding, and STEM problem-solving that benefits from the thinking (reasoning) mode
- Cost-sensitive, high-volume chat served in non-thinking mode for low latency
- Agentic and tool-using workflows that call functions across multiple turns
- Multilingual applications that need coverage well beyond English and Chinese
How to access
| Provider | Model ID |
|---|---|
| Alibaba Cloud Model Studio ↗ | qwen3-235b-a22b |
| OpenRouter ↗ | qwen/qwen3-235b-a22b |
| Ollama ↗ | qwen3 |
Qwen (open-weight) — every version
The full lineage of the Qwen (open-weight) line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.
| Version | Released | Context | License |
|---|---|---|---|
| Qwen3.6current | 2026-04 | — | Apache-2.0 |
| Qwen3.5 | 2026-02-16 | — | Apache-2.0 |
| Qwen3 (2507 update) | 2025-07 | — | Apache-2.0 |
| Qwen3 | 2025-04-28 | — | Apache-2.0 |
| Qwen2.5 | 2024-09 | — | Apache-2.0 |
| Qwen2 | 2024-06 | — | Apache-2.0 |
FAQ
Is Qwen3 open source and free to use?
Yes. All Qwen3 open-weight models are released under the permissive Apache 2.0 license, and the weights are freely downloadable from Hugging Face, ModelScope, Kaggle, and Ollama. You can self-host them for commercial use, or call a hosted endpoint such as Alibaba Cloud Model Studio or OpenRouter.
What are Qwen3's thinking and non-thinking modes?
Qwen3 runs a single set of weights in two modes. Thinking mode produces a step-by-step reasoning trace before its final answer, which helps on hard math, coding, and logic tasks. Non-thinking mode answers directly for faster, cheaper replies. You can switch per request with a flag or with /think and /no_think controls, so one model covers both deliberate reasoning and quick chat.
What sizes does Qwen3 come in?
Qwen3 ships dense models at 0.6B, 1.7B, 4B, 8B, 14B, and 32B parameters, plus two Mixture-of-Experts models: Qwen3-30B-A3B (30B total, 3B active) and the flagship Qwen3-235B-A22B (235B total, 22B active). The small dense models run on a laptop or single GPU, while the MoE flagship targets frontier-level reasoning at lower inference cost than its total size suggests.
How much context can Qwen3 handle?
Qwen3 models support 32,768 tokens of context natively, and the larger models extend to 131,072 tokens (128K) using YaRN scaling. Recommended generation length is up to 32,768 tokens, or 38,912 for complex reasoning problems.