Qwen3

Alibaba's open-weight Qwen3 family with one-model thinking / non-thinking modes, from 0.6B dense to a 235B-A22B MoE flagship

Overview

Qwen3 is the third-generation open-weight large language model family from Alibaba's Qwen team, released on April 28, 2025. Rather than a single model, it is a full lineup: dense models at 0.6B, 1.7B, 4B, 8B, 14B, and 32B parameters, plus two Mixture-of-Experts models — Qwen3-30B-A3B (30B total, 3B active) and the flagship Qwen3-235B-A22B (235B total, 22B active). Every model is released under the permissive Apache 2.0 license with downloadable weights on Hugging Face, ModelScope, Kaggle, and Ollama.

Qwen3's headline feature is a hybrid design: a single model can operate in a thinking mode that produces a step-by-step reasoning trace before its final answer, or a non-thinking mode that responds directly for low-latency chat. Users (and developers, via a flag or a /think and /no_think control) choose how much the model deliberates per task, so one deployment covers both hard reasoning and fast dialogue without swapping models. Qwen3 was pre-trained on roughly 36 trillion tokens — nearly double Qwen2.5 — spanning 119 languages and dialects, and post-trained with reinforcement learning for math, code, and agentic tool use.

The flagship Qwen3-235B-A22B activates only 22B of its 235B parameters per token, giving it inference cost closer to a mid-size dense model while competing with much larger systems on reasoning and coding benchmarks. The smaller, edge-friendly dense models (0.6B-8B) make Qwen3 practical to run locally on a laptop or single GPU. Hosted API access is available through Alibaba Cloud Model Studio and aggregators such as OpenRouter.

Released	2025-04-28
License	Apache 2.0
Weights	Open weights
Parameters	235B total / 22B active (MoE flagship); dense 0.6B-32B and 30B-A3B MoE also released
Context	32K (131K with YaRN)
Max output	32,768 tokens (38,912 for hard reasoning)
Architecture	Mixture-of-Experts and dense transformers. The flagship Qwen3-235B-A22B is an MoE with 235B total parameters and 22B activated per token, 94 layers, 128 experts (8 routed per token), and grouped-query attention (64 query heads / 4 key-value heads). The line also ships a 30B-A3B MoE (3B active) and dense models at 0.6B, 1.7B, 4B, 8B, 14B, and 32B. Every model runs a single set of weights in two modes: a thinking mode that emits a reasoning trace before answering, and a non-thinking mode for fast direct replies. Native context is 32,768 tokens, extendable to 131,072 (128K) via YaRN. Pre-trained on roughly 36 trillion tokens across 119 languages and dialects.
Modalities	Text
Status	Available

Benchmarks

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Pricing

Input	$0.455 / 1M tokens per 1M tokens
Output	$1.82 / 1M tokens per 1M tokens

Rates shown for the flagship Qwen3-235B-A22B via OpenRouter (a 35%-off promotional rate at time of capture). The open weights are free to self-host; smaller dense models are far cheaper or free to run locally.

Pricing source ↗

Strengths

Fully open weights under Apache 2.0 — commercial-friendly, self-hostable, and downloadable from Hugging Face, ModelScope, Kaggle, and Ollama
One model, two modes: switch between a deliberate thinking mode and a fast non-thinking mode without changing models
Wide size range, from a 0.6B dense model that runs on-device to a 235B-A22B MoE flagship for frontier-level reasoning
Efficient MoE flagship — only 22B of 235B parameters activate per token, keeping inference cost low relative to total scale
Strong math and coding results (AIME, LiveCodeBench) and high Arena-Hard alignment scores among open models
Broad multilingual coverage: pre-trained across 119 languages and dialects

Best for

Self-hosted or private-cloud assistants where open weights and Apache 2.0 licensing are required
On-device and edge deployment using the small dense models (0.6B-8B) on laptops or single GPUs
Math, coding, and STEM problem-solving that benefits from the thinking (reasoning) mode
Cost-sensitive, high-volume chat served in non-thinking mode for low latency
Agentic and tool-using workflows that call functions across multiple turns
Multilingual applications that need coverage well beyond English and Chinese

How to access

Provider	Model ID
Alibaba Cloud Model Studio ↗	`qwen3-235b-a22b`
OpenRouter ↗	`qwen/qwen3-235b-a22b`
Ollama ↗	`qwen3`

Qwen (open-weight) — every version

The full lineage of the Qwen (open-weight) line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

Version	Released	Context	License
Qwen3.6current	2026-04	—	Apache-2.0
Qwen3.5	2026-02-16	—	Apache-2.0
Qwen3 (2507 update)	2025-07	—	Apache-2.0
Qwen3	2025-04-28	—	Apache-2.0
Qwen2.5	2024-09	—	Apache-2.0
Qwen2	2024-06	—	Apache-2.0

FAQ

Is Qwen3 open source and free to use?

Yes. All Qwen3 open-weight models are released under the permissive Apache 2.0 license, and the weights are freely downloadable from Hugging Face, ModelScope, Kaggle, and Ollama. You can self-host them for commercial use, or call a hosted endpoint such as Alibaba Cloud Model Studio or OpenRouter.

What are Qwen3's thinking and non-thinking modes?

Qwen3 runs a single set of weights in two modes. Thinking mode produces a step-by-step reasoning trace before its final answer, which helps on hard math, coding, and logic tasks. Non-thinking mode answers directly for faster, cheaper replies. You can switch per request with a flag or with /think and /no_think controls, so one model covers both deliberate reasoning and quick chat.

What sizes does Qwen3 come in?

Qwen3 ships dense models at 0.6B, 1.7B, 4B, 8B, 14B, and 32B parameters, plus two Mixture-of-Experts models: Qwen3-30B-A3B (30B total, 3B active) and the flagship Qwen3-235B-A22B (235B total, 22B active). The small dense models run on a laptop or single GPU, while the MoE flagship targets frontier-level reasoning at lower inference cost than its total size suggests.

How much context can Qwen3 handle?

Qwen3 models support 32,768 tokens of context natively, and the larger models extend to 131,072 tokens (128K) using YaRN scaling. Recommended generation length is up to 32,768 tokens, or 38,912 for complex reasoning problems.

// Overview

// Benchmarks

// Pricing

// Strengths

// Best for

// How to access

// Qwen (open-weight) — every version

// FAQ