AI/TLDR

Qwen3-Next-80B-A3B

Alibaba's ultra-sparse 80B MoE that runs like a 3B model

Overview

Qwen3-Next-80B-A3B is Alibaba's Qwen team's first model built on the Qwen3-Next architecture, released September 12, 2025. It is a large language model with 80 billion total parameters but an ultra-sparse Mixture-of-Experts (MoE) design that activates only about 3 billion parameters (3.7%) per token. The goal is frontier-level quality at a fraction of the compute: Alibaba reports the base model was trained for under 10% of the GPU-hours of the dense Qwen3-32B while matching or beating it, and delivers more than 10x the inference throughput once context exceeds 32K tokens.

The architecture is what makes Qwen3-Next distinct. Instead of standard attention everywhere, it interleaves Gated DeltaNet (a fast linear-attention mechanism) with Gated Attention in a hybrid layout, then routes through an extremely sparse MoE of 512 experts (10 activated plus 1 shared) across 48 layers. Multi-Token Prediction (MTP) is used to speed up generation. Native context is 262,144 tokens and can be extended toward ~1 million tokens with YaRN scaling, making it well suited to long-document and long-context workloads.

Qwen3-Next-80B-A3B ships in two post-trained variants under the permissive Apache 2.0 license: Qwen3-Next-80B-A3B-Instruct (fast, non-thinking responses) and Qwen3-Next-80B-A3B-Thinking (chain-of-thought reasoning). Alibaba positions the Instruct variant as performing comparably to its much larger flagship Qwen3-235B-A22B-Instruct-2507, and the Thinking variant as a strong open reasoning model. It is text-only. Weights are on Hugging Face and the model is served via Alibaba Cloud Model Studio (DashScope) and third-party hosts like OpenRouter and Together AI.

Released2025-09-12
LicenseApache 2.0
WeightsOpen weights
Parameters80B total / 3B active
Context262K (up to ~1M)
Max output16K (Instruct), 32K (Thinking)
ArchitectureHybrid Mixture-of-Experts: Gated DeltaNet (linear attention) interleaved with Gated Attention, 512 experts (10 active + 1 shared) over 48 layers, with Multi-Token Prediction (MTP).
Knowledge cutoff2025
ModalitiesText
Statusavailable

Benchmarks

  1. MMLU-Pro (Instruct)80.6%
  2. GPQA (Instruct)72.9%
  3. AIME25 (Instruct)69.5%
  4. LiveCodeBench v6 (Instruct)56.6%
  5. Arena-Hard v2 (Instruct)82.7%
  6. AIME25 (Thinking)87.8%
  7. LiveCodeBench (Thinking)68.7%
  8. GPQA (Thinking)77.2%
  9. MMLU-Pro (Thinking)82.7%
  10. Artificial Analysis Intelligence Index (Instruct)14%

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Pricing

Input$0.09 / 1M tokens per 1M tokens
Output$1.10 / 1M tokens per 1M tokens

Qwen3-Next-80B-A3B-Instruct pricing on OpenRouter; rates vary by provider. Open weights can also be self-hosted at no per-token cost.

Pricing source ↗

Strengths

  • Extreme efficiency: 80B total parameters but only ~3B active per token, giving large-model quality at small-model inference cost
  • Long context: 262K tokens natively, extensible toward ~1M with YaRN, with 10x+ throughput vs Qwen3-32B beyond 32K
  • Permissive Apache 2.0 license with fully open weights on Hugging Face
  • Two variants for different needs: Instruct for fast answers, Thinking for step-by-step reasoning
  • Strong reasoning and coding scores that rival much larger models in its own family

Best for

  • Long-document analysis, summarization and RAG over large corpora that benefit from 256K+ context
  • Cost-sensitive production deployments needing large-model quality at low per-token compute
  • Math and coding tasks via the Thinking variant's chain-of-thought reasoning
  • Self-hosting on commodity GPU setups (incl. FP8 builds) where Apache 2.0 weights are required
  • Multilingual reasoning, knowledge QA and text generation in latency-sensitive apps

How to access

ProviderModel ID
Hugging Face ↗Qwen/Qwen3-Next-80B-A3B-Instruct
Alibaba Cloud Model Studio (DashScope) ↗qwen3-next-80b-a3b-instruct
OpenRouter ↗qwen/qwen3-next-80b-a3b-instruct
Together AI ↗Qwen/Qwen3-Next-80B-A3B-Instruct

FAQ

How many parameters does Qwen3-Next-80B-A3B actually use?

It has 80 billion total parameters but activates only about 3 billion (roughly 3.7%) per token, thanks to an ultra-sparse Mixture-of-Experts design that selects 10 of 512 experts plus 1 shared expert. That is why the name carries 'A3B' (active 3B) — you get large-model quality at small-model inference cost.

What is the context window of Qwen3-Next-80B-A3B?

It natively supports 262,144 tokens and can be extended toward roughly 1 million tokens using YaRN scaling. Alibaba reports more than 10x higher throughput than the dense Qwen3-32B once context length exceeds 32K tokens.

What is the difference between the Instruct and Thinking variants?

Qwen3-Next-80B-A3B-Instruct gives fast, direct answers without visible reasoning traces (non-thinking mode), with a recommended 16K max output. Qwen3-Next-80B-A3B-Thinking produces chain-of-thought reasoning for harder math and coding problems, with a longer recommended output (32K, up to ~82K for complex reasoning).

Is Qwen3-Next-80B-A3B open source and free to use?

Yes. Both the Instruct and Thinking variants are released under the permissive Apache 2.0 license with open weights on Hugging Face, so you can self-host them at no per-token cost. Hosted API access is also available through Alibaba Cloud Model Studio, OpenRouter and Together AI at provider-set rates.