AI/TLDR

Qwen3

Alibaba's open-weight Qwen3 family with one-model thinking / non-thinking modes, from 0.6B dense to a 235B-A22B MoE flagship

Overview

Qwen3 is the third-generation open-weight large language model family from Alibaba's Qwen team, released on April 28, 2025. Rather than a single model, it is a full lineup: dense models at 0.6B, 1.7B, 4B, 8B, 14B, and 32B parameters, plus two Mixture-of-Experts models — Qwen3-30B-A3B (30B total, 3B active) and the flagship Qwen3-235B-A22B (235B total, 22B active). Every model is released under the permissive Apache 2.0 license with downloadable weights on Hugging Face, ModelScope, Kaggle, and Ollama.

Qwen3's headline feature is a hybrid design: a single model can operate in a thinking mode that produces a step-by-step reasoning trace before its final answer, or a non-thinking mode that responds directly for low-latency chat. Users (and developers, via a flag or a /think and /no_think control) choose how much the model deliberates per task, so one deployment covers both hard reasoning and fast dialogue without swapping models. Qwen3 was pre-trained on roughly 36 trillion tokens — nearly double Qwen2.5 — spanning 119 languages and dialects, and post-trained with reinforcement learning for math, code, and agentic tool use.

The flagship Qwen3-235B-A22B activates only 22B of its 235B parameters per token, giving it inference cost closer to a mid-size dense model while competing with much larger systems on reasoning and coding benchmarks. The smaller, edge-friendly dense models (0.6B-8B) make Qwen3 practical to run locally on a laptop or single GPU. Hosted API access is available through Alibaba Cloud Model Studio and aggregators such as OpenRouter.

Released2025-04-28
LicenseApache 2.0
WeightsOpen weights
Parameters235B total / 22B active (MoE flagship); dense 0.6B-32B and 30B-A3B MoE also released
Context32K (131K with YaRN)
Max output32,768 tokens (38,912 for hard reasoning)
ArchitectureMixture-of-Experts and dense transformers. The flagship Qwen3-235B-A22B is an MoE with 235B total parameters and 22B activated per token, 94 layers, 128 experts (8 routed per token), and grouped-query attention (64 query heads / 4 key-value heads). The line also ships a 30B-A3B MoE (3B active) and dense models at 0.6B, 1.7B, 4B, 8B, 14B, and 32B. Every model runs a single set of weights in two modes: a thinking mode that emits a reasoning trace before answering, and a non-thinking mode for fast direct replies. Native context is 32,768 tokens, extendable to 131,072 (128K) via YaRN. Pre-trained on roughly 36 trillion tokens across 119 languages and dialects.
ModalitiesText
StatusAvailable

Benchmarks

  1. AIME 2024 (math, thinking mode)85.7%
  2. AIME 2025 (math, thinking mode)81.5%
  3. LiveCodeBench v5 (coding)70.7%
  4. Arena-Hard (alignment)95.6%
  5. BFCL v3 (function calling)70.8%
  6. MMLU-Pro (knowledge)68.18%

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Pricing

Input$0.455 / 1M tokens per 1M tokens
Output$1.82 / 1M tokens per 1M tokens

Rates shown for the flagship Qwen3-235B-A22B via OpenRouter (a 35%-off promotional rate at time of capture). The open weights are free to self-host; smaller dense models are far cheaper or free to run locally.

Pricing source ↗

Strengths

  • Fully open weights under Apache 2.0 — commercial-friendly, self-hostable, and downloadable from Hugging Face, ModelScope, Kaggle, and Ollama
  • One model, two modes: switch between a deliberate thinking mode and a fast non-thinking mode without changing models
  • Wide size range, from a 0.6B dense model that runs on-device to a 235B-A22B MoE flagship for frontier-level reasoning
  • Efficient MoE flagship — only 22B of 235B parameters activate per token, keeping inference cost low relative to total scale
  • Strong math and coding results (AIME, LiveCodeBench) and high Arena-Hard alignment scores among open models
  • Broad multilingual coverage: pre-trained across 119 languages and dialects

Best for

  • Self-hosted or private-cloud assistants where open weights and Apache 2.0 licensing are required
  • On-device and edge deployment using the small dense models (0.6B-8B) on laptops or single GPUs
  • Math, coding, and STEM problem-solving that benefits from the thinking (reasoning) mode
  • Cost-sensitive, high-volume chat served in non-thinking mode for low latency
  • Agentic and tool-using workflows that call functions across multiple turns
  • Multilingual applications that need coverage well beyond English and Chinese

How to access

ProviderModel ID
Alibaba Cloud Model Studio ↗qwen3-235b-a22b
OpenRouter ↗qwen/qwen3-235b-a22b
Ollama ↗qwen3

Qwen (open-weight) — every version

The full lineage of the Qwen (open-weight) line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

VersionReleasedContextLicense
Qwen3.6current2026-04Apache-2.0
Qwen3.52026-02-16Apache-2.0
Qwen3 (2507 update)2025-07Apache-2.0
Qwen32025-04-28Apache-2.0
Qwen2.52024-09Apache-2.0
Qwen22024-06Apache-2.0

FAQ

Is Qwen3 open source and free to use?

Yes. All Qwen3 open-weight models are released under the permissive Apache 2.0 license, and the weights are freely downloadable from Hugging Face, ModelScope, Kaggle, and Ollama. You can self-host them for commercial use, or call a hosted endpoint such as Alibaba Cloud Model Studio or OpenRouter.

What are Qwen3's thinking and non-thinking modes?

Qwen3 runs a single set of weights in two modes. Thinking mode produces a step-by-step reasoning trace before its final answer, which helps on hard math, coding, and logic tasks. Non-thinking mode answers directly for faster, cheaper replies. You can switch per request with a flag or with /think and /no_think controls, so one model covers both deliberate reasoning and quick chat.

What sizes does Qwen3 come in?

Qwen3 ships dense models at 0.6B, 1.7B, 4B, 8B, 14B, and 32B parameters, plus two Mixture-of-Experts models: Qwen3-30B-A3B (30B total, 3B active) and the flagship Qwen3-235B-A22B (235B total, 22B active). The small dense models run on a laptop or single GPU, while the MoE flagship targets frontier-level reasoning at lower inference cost than its total size suggests.

How much context can Qwen3 handle?

Qwen3 models support 32,768 tokens of context natively, and the larger models extend to 131,072 tokens (128K) using YaRN scaling. Recommended generation length is up to 32,768 tokens, or 38,912 for complex reasoning problems.