Qwen3-Next-80B-A3B

Alibaba's ultra-sparse 80B MoE that runs like a 3B model

Overview

Qwen3-Next-80B-A3B is Alibaba's Qwen team's first model built on the Qwen3-Next architecture, released September 12, 2025. It is a large language model with 80 billion total parameters but an ultra-sparse Mixture-of-Experts (MoE) design that activates only about 3 billion parameters (3.7%) per token. The goal is frontier-level quality at a fraction of the compute: Alibaba reports the base model was trained for under 10% of the GPU-hours of the dense Qwen3-32B while matching or beating it, and delivers more than 10x the inference throughput once context exceeds 32K tokens.

The architecture is what makes Qwen3-Next distinct. Instead of standard attention everywhere, it interleaves Gated DeltaNet (a fast linear-attention mechanism) with Gated Attention in a hybrid layout, then routes through an extremely sparse MoE of 512 experts (10 activated plus 1 shared) across 48 layers. Multi-Token Prediction (MTP) is used to speed up generation. Native context is 262,144 tokens and can be extended toward ~1 million tokens with YaRN scaling, making it well suited to long-document and long-context workloads.

Qwen3-Next-80B-A3B ships in two post-trained variants under the permissive Apache 2.0 license: Qwen3-Next-80B-A3B-Instruct (fast, non-thinking responses) and Qwen3-Next-80B-A3B-Thinking (chain-of-thought reasoning). Alibaba positions the Instruct variant as performing comparably to its much larger flagship Qwen3-235B-A22B-Instruct-2507, and the Thinking variant as a strong open reasoning model. It is text-only. Weights are on Hugging Face and the model is served via Alibaba Cloud Model Studio (DashScope) and third-party hosts like OpenRouter and Together AI.

Released	2025-09-12
License	Apache 2.0
Weights	Open weights
Parameters	80B total / 3B active
Context	262K (up to ~1M)
Max output	16K (Instruct), 32K (Thinking)
Architecture	Hybrid Mixture-of-Experts: Gated DeltaNet (linear attention) interleaved with Gated Attention, 512 experts (10 active + 1 shared) over 48 layers, with Multi-Token Prediction (MTP).
Knowledge cutoff	2025
Modalities	Text
Status	available

Benchmarks

MMLU-Pro (Instruct)80.6%
GPQA (Instruct)72.9%
AIME25 (Instruct)69.5%
LiveCodeBench v6 (Instruct)56.6%
Arena-Hard v2 (Instruct)82.7%
AIME25 (Thinking)87.8%
LiveCodeBench (Thinking)68.7%
GPQA (Thinking)77.2%
MMLU-Pro (Thinking)82.7%
Artificial Analysis Intelligence Index (Instruct)14%

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Pricing

Input	$0.09 / 1M tokens per 1M tokens
Output	$1.10 / 1M tokens per 1M tokens

Qwen3-Next-80B-A3B-Instruct pricing on OpenRouter; rates vary by provider. Open weights can also be self-hosted at no per-token cost.

Pricing source ↗

Strengths

Extreme efficiency: 80B total parameters but only ~3B active per token, giving large-model quality at small-model inference cost
Long context: 262K tokens natively, extensible toward ~1M with YaRN, with 10x+ throughput vs Qwen3-32B beyond 32K
Permissive Apache 2.0 license with fully open weights on Hugging Face
Two variants for different needs: Instruct for fast answers, Thinking for step-by-step reasoning
Strong reasoning and coding scores that rival much larger models in its own family

Best for

Long-document analysis, summarization and RAG over large corpora that benefit from 256K+ context
Cost-sensitive production deployments needing large-model quality at low per-token compute
Math and coding tasks via the Thinking variant's chain-of-thought reasoning
Self-hosting on commodity GPU setups (incl. FP8 builds) where Apache 2.0 weights are required
Multilingual reasoning, knowledge QA and text generation in latency-sensitive apps

How to access

Provider	Model ID
Hugging Face ↗	`Qwen/Qwen3-Next-80B-A3B-Instruct`
Alibaba Cloud Model Studio (DashScope) ↗	`qwen3-next-80b-a3b-instruct`
OpenRouter ↗	`qwen/qwen3-next-80b-a3b-instruct`
Together AI ↗	`Qwen/Qwen3-Next-80B-A3B-Instruct`

FAQ

How many parameters does Qwen3-Next-80B-A3B actually use?

It has 80 billion total parameters but activates only about 3 billion (roughly 3.7%) per token, thanks to an ultra-sparse Mixture-of-Experts design that selects 10 of 512 experts plus 1 shared expert. That is why the name carries 'A3B' (active 3B) — you get large-model quality at small-model inference cost.

What is the context window of Qwen3-Next-80B-A3B?

It natively supports 262,144 tokens and can be extended toward roughly 1 million tokens using YaRN scaling. Alibaba reports more than 10x higher throughput than the dense Qwen3-32B once context length exceeds 32K tokens.

What is the difference between the Instruct and Thinking variants?

Qwen3-Next-80B-A3B-Instruct gives fast, direct answers without visible reasoning traces (non-thinking mode), with a recommended 16K max output. Qwen3-Next-80B-A3B-Thinking produces chain-of-thought reasoning for harder math and coding problems, with a longer recommended output (32K, up to ~82K for complex reasoning).

Is Qwen3-Next-80B-A3B open source and free to use?

Yes. Both the Instruct and Thinking variants are released under the permissive Apache 2.0 license with open weights on Hugging Face, so you can self-host them at no per-token cost. Hosted API access is also available through Alibaba Cloud Model Studio, OpenRouter and Together AI at provider-set rates.

// Overview

// Benchmarks

// Pricing

// Strengths

// Best for

// How to access

// FAQ