What Is a Mixture-of-Experts (MoE) Model?

Q: What is the difference between total and active parameters in an MoE model?

**Total parameters** is everything the model contains across all experts — it sets raw knowledge and capacity, and it determines how much memory you need to load the model. **Active parameters** is the subset that actually runs for each token (the router's chosen top-k experts plus any shared experts), and it sets the compute cost and speed. An MoE labeled `235B-A22B` has 235B total but only ~22B active per token, so it runs roughly as fast as a 22B dense model while knowing far more.

Q: Is a mixture-of-experts model actually smaller or cheaper?

It's cheaper to *run*, not to *store*. All experts must sit in memory, so an MoE needs as much RAM/VRAM as a dense model of the same total size. What you save is compute: only the active slice fires per token, so inference is much faster and cheaper than a dense model with the same total parameter count. MoE trades plentiful memory for scarce compute.

See how MoE models get huge capacity at a fraction of the compute by routing each token to a handful of specialist experts.

ADVANCED10 MIN READUPDATED 2026-06-12

In plain English

A Mixture-of-Experts (MoE) model is a neural network that contains many small sub-networks called experts, but only switches a few of them on for each piece of text it reads. The model is enormous on paper, yet most of it sits idle on any given step. That is the whole trick: huge total capacity, small active cost.

Picture a hospital. It employs hundreds of specialists, but you don't see all of them when you walk in. A triage nurse glances at your problem and routes you to maybe two of them — a cardiologist and a radiologist, say. The hospital has the knowledge of every specialist on staff, but your visit only pays for the two who actually treat you. An MoE model works the same way. It holds the knowledge of a vast network, but every token (every chunk of text — see what a token is) only activates a tiny slice.

Contrast this with a normal dense model, where every single parameter fires for every token. A dense model is like a hospital that drags all 300 specialists into the room for every patient. Thorough, but absurdly expensive. MoE keeps the staff and ditches the crowd.

Why it matters

Scaling laws showed that bigger models with more parameters tend to be smarter. But for a dense model, bigger means more expensive on every single token — both to train and to run. Doubling the parameters roughly doubles the compute (and the GPU bill — see why LLMs need GPUs) for every word generated. That ceiling is brutal.

MoE breaks the link between how much a model knows and how much it costs to use. You can grow total parameters to add knowledge while keeping active parameters — and therefore inference cost — roughly flat. A model with 400B total parameters can run at the speed and price of a 17B dense model if only 17B are active per token.

Who cares about this

Frontier labs — it lets them ship models with trillions of total parameters without trillion-parameter inference bills.
Anyone reading a model card — when you see "235B-A22B" or "109B / 17B active", that's MoE notation. It tells you the model punches above its compute weight.
People running models locally — MoE shifts the bottleneck from compute to memory. You still need RAM/VRAM to hold all the experts, even though only a few run at once. This shapes quantization and hardware choices.
Cost-conscious builders — cheaper active compute is a big reason API prices keep falling for a given quality tier.

How it works

To see where the experts live, you need a quick picture of a transformer layer. Each layer has two main parts: an attention block (which lets tokens look at each other — see how attention works) and a feed-forward network (FFN), which does most of the heavy per-token computation.

MoE replaces that single FFN with many FFNs — the experts — plus a tiny router that decides which experts each token should visit. Attention usually stays dense and shared; only the FFN goes sparse.

// One MoE layer, for one token

Token arriveshidden vector from attentionRouter scores expertssmall linear layer + softmaxPick top-ke.g. top-2 of 8 expertsRun only those expertsthe rest stay offWeighted sumblend outputs by router weight

Walk through it. A token's hidden vector reaches the MoE layer. The router — usually a single small linear layer followed by a softmax — produces a score for every expert. The model keeps only the top-k highest-scoring experts (commonly top-2, sometimes top-4 or top-8), runs just those, and combines their outputs weighted by the router's scores. Every other expert is skipped entirely, so its parameters cost nothing on this token.

The load-balancing problem

Left alone, the router cheats. Early in training it discovers a few experts are slightly better and sends everything to them. Those experts get all the practice and improve; the rest starve and stay useless. This is expert collapse — most experts go dead while a handful do all the work.

Labs fix this with load balancing. The classic approach adds an auxiliary loss that penalizes the router for piling tokens onto a few experts, nudging it toward even usage. A newer approach, popularized by DeepSeek, is auxiliary-loss-free balancing: instead of a loss term that fights the main training objective, it adds a small learned bias to each expert's router score and nudges those biases up or down based on how busy each expert has been recently. Both aim for the same thing — every expert pulling its weight.

Shared experts

Many current designs (DeepSeek-V3, Llama 4) add one or two shared experts that run for every token, alongside the routed top-k. The shared expert soaks up general, always-useful computation, freeing the routed experts to specialize. It's a hybrid: a small dense core plus a big sparse pool.

MoE vs dense, side by side

The cleanest way to feel the difference is to compare the same FFN slot in both styles.

// Dense FFN vs MoE FFN

Dense model

One big FFN per layer
Every parameter fires per token
Compute scales with total size
Simpler to train and serve
Quality ceiling tied to compute budget

MoE model

Many expert FFNs + a router
Only top-k experts fire per token
Compute scales with active size
Needs load balancing to train well
Higher quality per FLOP at the same speed

The headline result, repeated across labs: for a fixed training-compute budget, a sparse MoE model reaches higher quality than a dense model. You spend your FLOPs on more total knowledge instead of forcing every token through every parameter. The catch is engineering complexity — routing, balancing, and the memory to hold all those experts.

Property	Dense	MoE
Parameters that fire per token	All of them	Only the active slice
Inference cost	Scales with total size	Scales with active size
Memory to hold the model	= total size	= total size (no savings)
Training difficulty	Lower	Higher (routing + balancing)
Quality per FLOP	Baseline	Higher

The current landscape (mid-2026)

MoE went from a research curiosity to the default for large open and frontier models. The pattern you'll see on model cards is total / active — for example Qwen3-235B-A22B means 235B total parameters, 22B active per token. Here are verified examples as of mid-2026.

Model	Total params	Active per token	Experts / routing
Mixtral 8x7B (Mistral)	~47B	~13B	8 experts, top-2
DeepSeek-V3	~671B	~37B	256 routed + shared, top-8
Llama 4 Scout	~109B	~17B	16 experts + shared
Llama 4 Maverick	~400B	~17B	128 experts + shared
Qwen3-235B-A22B	~235B	~22B	128 experts
gpt-oss-120b (OpenAI)	~117B	~5.1B	128 experts, top-4

A few things stand out. Mixtral 8x7B (early 2024) was the open model that made MoE mainstream — 8 experts, top-2, ~13B active, matching much larger dense models. DeepSeek-V3 pushed the count to 256 fine-grained routed experts and popularized auxiliary-loss-free balancing. Llama 4 keeps active parameters tiny (17B) while scaling total parameters up across the Scout/Maverick/Behemoth herd. And gpt-oss showed an open-weight frontier lab releasing MoE with a strikingly low active count (~5B). These specifics move fast — but the shape of the trend has been stable: more total experts, modest active compute.

See the routing in code

Here's a stripped-down MoE layer in PyTorch. It's not production code, but it shows the exact mechanic: score experts, keep the top-k, run only those, blend by router weight.

moe_layer.pypython

import torch
import torch.nn as nn
import torch.nn.functional as F

class MoELayer(nn.Module):
    def __init__(self, dim, num_experts=8, top_k=2):
        super().__init__()
        self.top_k = top_k
        self.router = nn.Linear(dim, num_experts)
        # each expert is just a small feed-forward network
        self.experts = nn.ModuleList(
            nn.Sequential(nn.Linear(dim, 4 * dim), nn.GELU(),
                          nn.Linear(4 * dim, dim))
            for _ in range(num_experts)
        )

    def forward(self, x):                       # x: (tokens, dim)
        scores = self.router(x)                 # (tokens, num_experts)
        weights, idx = scores.topk(self.top_k, dim=-1)
        weights = F.softmax(weights, dim=-1)    # normalize the chosen few

        out = torch.zeros_like(x)
        for slot in range(self.top_k):
            for e, expert in enumerate(self.experts):
                mask = idx[:, slot] == e        # tokens routed to expert e
                if mask.any():
                    out[mask] += weights[mask, slot:slot+1] * expert(x[mask])
        return out                              # most experts never ran

The key line is scores.topk(self.top_k). Out of every expert, only top_k of them ever see a given token, and mask.any() skips the rest entirely. In a real system this dispatch is heavily optimized — tokens are grouped per expert and run in big batched matrix multiplies — but the logic is exactly this: score, select, run the few, blend.

Going deeper

Fine-grained experts and the granularity dial

Early MoE used a few fat experts (Mixtral: 8). Newer designs use many thin experts (DeepSeek-V3: 256 routed) and pick more of them per token. This fine-grained approach gives the router more combinations to express — picking 8 of 256 yields far more distinct routings than 2 of 8 — which empirically improves quality at the same active parameter count. Granularity is now a deliberate design knob, not an accident.

Token-choice vs expert-choice routing

The default is token-choice: each token picks its top-k experts. The risk is imbalance — popular experts overflow their capacity and excess tokens get dropped (skipped) or spilled. Expert-choice flips it: each expert picks its top-k tokens, which guarantees perfectly even load but means some tokens may get more experts than others. Production systems also set an expert capacity limit and use techniques like loss-free bias updates to keep utilization smooth without dropping tokens.

The real bottleneck is communication, not math

On a single GPU, MoE is straightforward. At scale, experts are spread across many GPUs (expert parallelism), so routing a token to its experts means shipping its activations across the network and gathering the results back — an all-to-all communication step. For big MoE models this network shuffle, not the matrix math, often dominates inference latency. It's why MoE serving stacks obsess over interconnect bandwidth and clever expert placement, and why a model with low active FLOPs can still be tricky to serve fast.

What experts do (and don't) specialize in

It's tempting to imagine a "Python expert" and a "French expert." Reality is messier. Analysis of trained MoE models shows experts specialize along subtle, often uninterpretable lines — sometimes by token type or syntax, rarely by clean human-readable topic. The router isn't trained to be interpretable; it's trained to lower loss. So treat "expert" as a useful name for a learned subnetwork the router favors for certain inputs, not a labeled domain specialist.

FAQ

What is the difference between total and active parameters in an MoE model?

Total parameters is everything the model contains across all experts — it sets raw knowledge and capacity, and it determines how much memory you need to load the model. Active parameters is the subset that actually runs for each token (the router's chosen top-k experts plus any shared experts), and it sets the compute cost and speed. An MoE labeled 235B-A22B has 235B total but only ~22B active per token, so it runs roughly as fast as a 22B dense model while knowing far more.

Is a mixture-of-experts model actually smaller or cheaper?

It's cheaper to run, not to store. All experts must sit in memory, so an MoE needs as much RAM/VRAM as a dense model of the same total size. What you save is compute: only the active slice fires per token, so inference is much faster and cheaper than a dense model with the same total parameter count. MoE trades plentiful memory for scarce compute.

How does the router decide which experts to use?

The router is a tiny linear layer that produces a score for every expert from the token's hidden vector, normalized with a softmax. The model keeps the top-k highest-scoring experts (often top-2, sometimes top-4 or top-8), runs only those, and blends their outputs weighted by the router scores. The router is trained jointly with the rest of the model, with a load-balancing mechanism to stop it from overusing a few favorite experts.

Which LLMs use mixture-of-experts as of 2026?

Many of the largest open and frontier models do. Verified examples include Mixtral 8x7B (the model that made open MoE mainstream), DeepSeek-V3 (~671B total / ~37B active, 256 routed experts), Meta's Llama 4 herd (Scout, Maverick, Behemoth), Qwen3-235B-A22B, and OpenAI's open-weight gpt-oss models. These specifics change fast, but MoE is now the default architecture for very large models.

What is expert collapse and how do labs prevent it?

Expert collapse is when the router learns to send almost all tokens to a few experts, so those few improve while the rest never train and go dead — wasting the model's capacity. Labs prevent it with load balancing: either an auxiliary loss that penalizes uneven expert usage, or an auxiliary-loss-free scheme (popularized by DeepSeek) that adjusts a per-expert bias based on recent load. Both push the router toward using every expert.

// In plain English

// Why it matters

Who cares about this

// How it works

The load-balancing problem

Shared experts

// MoE vs dense, side by side

// The current landscape (mid-2026)

// See the routing in code

// Going deeper

Fine-grained experts and the granularity dial

Token-choice vs expert-choice routing

The real bottleneck is communication, not math

What experts do (and don't) specialize in

// FAQ

// Further reading

// Related

In plain English

Why it matters

How it works

MoE vs dense, side by side

The current landscape (mid-2026)

See the routing in code

Going deeper

FAQ

Further reading

Related