AI/TLDR

What Is a Mixture-of-Experts (MoE) Model?

See how MoE models get huge capacity at a fraction of the compute by routing each token to a handful of specialist experts.

ADVANCED10 MIN READUPDATED 2026-06-12

In plain English

A Mixture-of-Experts (MoE) model is a neural network that contains many small sub-networks called experts, but only switches a few of them on for each piece of text it reads. The model is enormous on paper, yet most of it sits idle on any given step. That is the whole trick: huge total capacity, small active cost.

Picture a hospital. It employs hundreds of specialists, but you don't see all of them when you walk in. A triage nurse glances at your problem and routes you to maybe two of them — a cardiologist and a radiologist, say. The hospital has the knowledge of every specialist on staff, but your visit only pays for the two who actually treat you. An MoE model works the same way. It holds the knowledge of a vast network, but every token (every chunk of text — see what a token is) only activates a tiny slice.

Contrast this with a normal dense model, where every single parameter fires for every token. A dense model is like a hospital that drags all 300 specialists into the room for every patient. Thorough, but absurdly expensive. MoE keeps the staff and ditches the crowd.

Why it matters

Scaling laws showed that bigger models with more parameters tend to be smarter. But for a dense model, bigger means more expensive on every single token — both to train and to run. Doubling the parameters roughly doubles the compute (and the GPU bill — see why LLMs need GPUs) for every word generated. That ceiling is brutal.

MoE breaks the link between how much a model knows and how much it costs to use. You can grow total parameters to add knowledge while keeping active parameters — and therefore inference cost — roughly flat. A model with 400B total parameters can run at the speed and price of a 17B dense model if only 17B are active per token.

Who cares about this

  • Frontier labs — it lets them ship models with trillions of total parameters without trillion-parameter inference bills.
  • Anyone reading a model card — when you see "235B-A22B" or "109B / 17B active", that's MoE notation. It tells you the model punches above its compute weight.
  • People running models locally — MoE shifts the bottleneck from compute to memory. You still need RAM/VRAM to hold all the experts, even though only a few run at once. This shapes quantization and hardware choices.
  • Cost-conscious builders — cheaper active compute is a big reason API prices keep falling for a given quality tier.

How it works

To see where the experts live, you need a quick picture of a transformer layer. Each layer has two main parts: an attention block (which lets tokens look at each other — see how attention works) and a feed-forward network (FFN), which does most of the heavy per-token computation.

MoE replaces that single FFN with many FFNs — the experts — plus a tiny router that decides which experts each token should visit. Attention usually stays dense and shared; only the FFN goes sparse.

Walk through it. A token's hidden vector reaches the MoE layer. The router — usually a single small linear layer followed by a softmax — produces a score for every expert. The model keeps only the top-k highest-scoring experts (commonly top-2, sometimes top-4 or top-8), runs just those, and combines their outputs weighted by the router's scores. Every other expert is skipped entirely, so its parameters cost nothing on this token.

The load-balancing problem

Left alone, the router cheats. Early in training it discovers a few experts are slightly better and sends everything to them. Those experts get all the practice and improve; the rest starve and stay useless. This is expert collapse — most experts go dead while a handful do all the work.

Labs fix this with load balancing. The classic approach adds an auxiliary loss that penalizes the router for piling tokens onto a few experts, nudging it toward even usage. A newer approach, popularized by DeepSeek, is auxiliary-loss-free balancing: instead of a loss term that fights the main training objective, it adds a small learned bias to each expert's router score and nudges those biases up or down based on how busy each expert has been recently. Both aim for the same thing — every expert pulling its weight.

Shared experts

Many current designs (DeepSeek-V3, Llama 4) add one or two shared experts that run for every token, alongside the routed top-k. The shared expert soaks up general, always-useful computation, freeing the routed experts to specialize. It's a hybrid: a small dense core plus a big sparse pool.

MoE vs dense, side by side

The cleanest way to feel the difference is to compare the same FFN slot in both styles.

The headline result, repeated across labs: for a fixed training-compute budget, a sparse MoE model reaches higher quality than a dense model. You spend your FLOPs on more total knowledge instead of forcing every token through every parameter. The catch is engineering complexity — routing, balancing, and the memory to hold all those experts.

PropertyDenseMoE
Parameters that fire per tokenAll of themOnly the active slice
Inference costScales with total sizeScales with active size
Memory to hold the model= total size= total size (no savings)
Training difficultyLowerHigher (routing + balancing)
Quality per FLOPBaselineHigher

The current landscape (mid-2026)

MoE went from a research curiosity to the default for large open and frontier models. The pattern you'll see on model cards is total / active — for example Qwen3-235B-A22B means 235B total parameters, 22B active per token. Here are verified examples as of mid-2026.

ModelTotal paramsActive per tokenExperts / routing
Mixtral 8x7B (Mistral)~47B~13B8 experts, top-2
DeepSeek-V3~671B~37B256 routed + shared, top-8
Llama 4 Scout~109B~17B16 experts + shared
Llama 4 Maverick~400B~17B128 experts + shared
Qwen3-235B-A22B~235B~22B128 experts
gpt-oss-120b (OpenAI)~117B~5.1B128 experts, top-4

A few things stand out. Mixtral 8x7B (early 2024) was the open model that made MoE mainstream — 8 experts, top-2, ~13B active, matching much larger dense models. DeepSeek-V3 pushed the count to 256 fine-grained routed experts and popularized auxiliary-loss-free balancing. Llama 4 keeps active parameters tiny (17B) while scaling total parameters up across the Scout/Maverick/Behemoth herd. And gpt-oss showed an open-weight frontier lab releasing MoE with a strikingly low active count (~5B). These specifics move fast — but the shape of the trend has been stable: more total experts, modest active compute.

See the routing in code

Here's a stripped-down MoE layer in PyTorch. It's not production code, but it shows the exact mechanic: score experts, keep the top-k, run only those, blend by router weight.

moe_layer.pypython
import torch
import torch.nn as nn
import torch.nn.functional as F

class MoELayer(nn.Module):
    def __init__(self, dim, num_experts=8, top_k=2):
        super().__init__()
        self.top_k = top_k
        self.router = nn.Linear(dim, num_experts)
        # each expert is just a small feed-forward network
        self.experts = nn.ModuleList(
            nn.Sequential(nn.Linear(dim, 4 * dim), nn.GELU(),
                          nn.Linear(4 * dim, dim))
            for _ in range(num_experts)
        )

    def forward(self, x):                       # x: (tokens, dim)
        scores = self.router(x)                 # (tokens, num_experts)
        weights, idx = scores.topk(self.top_k, dim=-1)
        weights = F.softmax(weights, dim=-1)    # normalize the chosen few

        out = torch.zeros_like(x)
        for slot in range(self.top_k):
            for e, expert in enumerate(self.experts):
                mask = idx[:, slot] == e        # tokens routed to expert e
                if mask.any():
                    out[mask] += weights[mask, slot:slot+1] * expert(x[mask])
        return out                              # most experts never ran

The key line is scores.topk(self.top_k). Out of every expert, only top_k of them ever see a given token, and mask.any() skips the rest entirely. In a real system this dispatch is heavily optimized — tokens are grouped per expert and run in big batched matrix multiplies — but the logic is exactly this: score, select, run the few, blend.

Going deeper

Fine-grained experts and the granularity dial

Early MoE used a few fat experts (Mixtral: 8). Newer designs use many thin experts (DeepSeek-V3: 256 routed) and pick more of them per token. This fine-grained approach gives the router more combinations to express — picking 8 of 256 yields far more distinct routings than 2 of 8 — which empirically improves quality at the same active parameter count. Granularity is now a deliberate design knob, not an accident.

Token-choice vs expert-choice routing

The default is token-choice: each token picks its top-k experts. The risk is imbalance — popular experts overflow their capacity and excess tokens get dropped (skipped) or spilled. Expert-choice flips it: each expert picks its top-k tokens, which guarantees perfectly even load but means some tokens may get more experts than others. Production systems also set an expert capacity limit and use techniques like loss-free bias updates to keep utilization smooth without dropping tokens.

The real bottleneck is communication, not math

On a single GPU, MoE is straightforward. At scale, experts are spread across many GPUs (expert parallelism), so routing a token to its experts means shipping its activations across the network and gathering the results back — an all-to-all communication step. For big MoE models this network shuffle, not the matrix math, often dominates inference latency. It's why MoE serving stacks obsess over interconnect bandwidth and clever expert placement, and why a model with low active FLOPs can still be tricky to serve fast.

What experts do (and don't) specialize in

It's tempting to imagine a "Python expert" and a "French expert." Reality is messier. Analysis of trained MoE models shows experts specialize along subtle, often uninterpretable lines — sometimes by token type or syntax, rarely by clean human-readable topic. The router isn't trained to be interpretable; it's trained to lower loss. So treat "expert" as a useful name for a learned subnetwork the router favors for certain inputs, not a labeled domain specialist.

FAQ

What is the difference between total and active parameters in an MoE model?

Total parameters is everything the model contains across all experts — it sets raw knowledge and capacity, and it determines how much memory you need to load the model. Active parameters is the subset that actually runs for each token (the router's chosen top-k experts plus any shared experts), and it sets the compute cost and speed. An MoE labeled 235B-A22B has 235B total but only ~22B active per token, so it runs roughly as fast as a 22B dense model while knowing far more.

Is a mixture-of-experts model actually smaller or cheaper?

It's cheaper to run, not to store. All experts must sit in memory, so an MoE needs as much RAM/VRAM as a dense model of the same total size. What you save is compute: only the active slice fires per token, so inference is much faster and cheaper than a dense model with the same total parameter count. MoE trades plentiful memory for scarce compute.

How does the router decide which experts to use?

The router is a tiny linear layer that produces a score for every expert from the token's hidden vector, normalized with a softmax. The model keeps the top-k highest-scoring experts (often top-2, sometimes top-4 or top-8), runs only those, and blends their outputs weighted by the router scores. The router is trained jointly with the rest of the model, with a load-balancing mechanism to stop it from overusing a few favorite experts.

Which LLMs use mixture-of-experts as of 2026?

Many of the largest open and frontier models do. Verified examples include Mixtral 8x7B (the model that made open MoE mainstream), DeepSeek-V3 (~671B total / ~37B active, 256 routed experts), Meta's Llama 4 herd (Scout, Maverick, Behemoth), Qwen3-235B-A22B, and OpenAI's open-weight gpt-oss models. These specifics change fast, but MoE is now the default architecture for very large models.

What is expert collapse and how do labs prevent it?

Expert collapse is when the router learns to send almost all tokens to a few experts, so those few improve while the rest never train and go dead — wasting the model's capacity. Labs prevent it with load balancing: either an auxiliary loss that penalizes uneven expert usage, or an auxiliary-loss-free scheme (popularized by DeepSeek) that adjusts a per-expert bias based on recent load. Both push the router toward using every expert.

Further reading