Moonlight-16B-A3B

Moonshot AI's 16B/3B MoE model that proved the Muon optimizer scales

Overview

Moonlight-16B-A3B is an open-weight Mixture-of-Experts language model from Moonshot AI (the lab behind Kimi), released in February 2025 alongside the research paper "Muon is Scalable for LLM Training" (arXiv:2502.16982). Its main purpose was to prove a point: that the Muon optimizer, rather than the near-universal AdamW, can train a real large language model efficiently at scale. The model has 15.29B total parameters but activates only 2.24B per token, which is why it is marketed as "16B/3B" and carries the "A3B" (activated 3B) tag in its name.

Architecturally, Moonlight-16B-A3B reuses the DeepSeek-V3 design, so it works out of the box with popular inference engines like vLLM and SGLang. It was trained from scratch on 5.7 trillion tokens using Muon, an optimizer based on matrix orthogonalization. Moonshot's contribution was making Muon scale: adding weight decay and keeping the per-parameter update RMS consistent. The team reports roughly 2x the sample efficiency of Adam, reaching comparable quality at about 52% of the training compute. Moonshot ships two variants on Hugging Face: the base Moonlight-16B-A3B and the chat-tuned Moonlight-16B-A3B-Instruct, both under the MIT license.

Despite activating only 2.24B parameters, Moonlight beats similarly-sized dense and MoE baselines (Llama3.2-3B, Qwen2.5-3B, DeepSeek-v2-Lite) across English, code, math, and Chinese benchmarks, advancing the performance-per-FLOP Pareto frontier. It is a text-only model with an 8K context window. Today it is overshadowed in Moonshot's own catalog by the far larger Kimi K2 models, but it remains an important reference point for anyone studying alternative optimizers and efficient MoE training.

License	MIT
Weights	Open weights
Parameters	15.29B total / 2.24B activated (marketed as 16B/3B, hence the "A3B" name)
Context	8K tokens (8,192)
Max output	Not separately published; bounded by the 8K context window
Architecture	Sparse Mixture-of-Experts using the same architecture as DeepSeek-V3 (so it runs on vLLM and SGLang). 15.29B total parameters with only 2.24B activated per token. Trained from scratch on 5.7T tokens with the Muon optimizer rather than AdamW; Moonshot's scaling fixes were weight decay plus consistent per-parameter RMS update scaling, which the team reports as roughly 2x more sample-efficient than Adam (about 52% of the training FLOPs for comparable quality).
Knowledge cutoff	Not officially published
Modalities	text
Status	Released February 2025; open weights remain available on Hugging Face. Superseded in Moonshot's lineup by the much larger Kimi K2 models, but not formally deprecated.

Benchmarks

MMLU70%
MMLU-pro42.4%
BBH65.2%
TriviaQA66.3%
HumanEval48.1%
MBPP63.8%
GSM8K77.4%
MATH45.3%
C-Eval77.2%
CMMLU78.2%

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Strengths

Strong quality for its tiny 2.24B active-parameter footprint, beating Llama3.2-3B and DeepSeek-v2-Lite across most benchmarks
Trained with the Muon optimizer, demonstrating roughly 2x the sample efficiency of Adam and about 52% of the training FLOPs for comparable results
Open weights under the permissive MIT license, for both base and instruct variants
Drop-in compatible with vLLM and SGLang thanks to the shared DeepSeek-V3 architecture
Cheap to serve: only 2.24B parameters are active per token despite 15.29B total

Best for

Research into alternative optimizers (Muon vs AdamW) and efficient MoE pretraining
Lightweight self-hosted chat and instruction-following where a small active footprint matters
On-prem code generation and math reasoning at low inference cost
English and Chinese language tasks (it was tuned on both)
A reference baseline for benchmarking new small MoE models

How to access

Provider	Model ID
Hugging Face (self-host / weights) ↗	`moonshotai/Moonlight-16B-A3B-Instruct`
OpenRouter (free endpoint) ↗	`moonshotai/moonlight-16b-a3b-instruct`

FAQ

What is Moonlight-16B-A3B?

It is an open-weight Mixture-of-Experts language model from Moonshot AI, released in February 2025. It has 15.29B total parameters but activates only 2.24B per token (hence the "16B/3B" / "A3B" naming), and its main claim to fame is being trained with the Muon optimizer instead of AdamW.

What makes Moonlight special compared to other small models?

It was used to prove that the Muon optimizer scales to real LLM training. Moonshot reports roughly 2x the sample efficiency of Adam, reaching comparable quality at about 52% of the training FLOPs. With only 2.24B active parameters it still beats Llama3.2-3B and DeepSeek-v2-Lite on most English, code, math, and Chinese benchmarks.

Is Moonlight-16B-A3B free and open source?

Yes. Both the base and Instruct variants are released under the MIT license on Hugging Face, so you can download, fine-tune, and self-host them freely. Because it shares the DeepSeek-V3 architecture, it runs on vLLM and SGLang.

What is its context window?

Moonlight-16B-A3B supports an 8K-token (8,192) context window and is a text-only model.

// Overview

// Benchmarks

// Strengths

// Best for

// How to access

// FAQ

Overview

Benchmarks

Strengths

Best for

How to access

FAQ