AI/TLDR

Moonlight-16B-A3B

Moonshot AI's 16B/3B MoE model that proved the Muon optimizer scales

Overview

Moonlight-16B-A3B is an open-weight Mixture-of-Experts language model from Moonshot AI (the lab behind Kimi), released in February 2025 alongside the research paper "Muon is Scalable for LLM Training" (arXiv:2502.16982). Its main purpose was to prove a point: that the Muon optimizer, rather than the near-universal AdamW, can train a real large language model efficiently at scale. The model has 15.29B total parameters but activates only 2.24B per token, which is why it is marketed as "16B/3B" and carries the "A3B" (activated 3B) tag in its name.

Architecturally, Moonlight-16B-A3B reuses the DeepSeek-V3 design, so it works out of the box with popular inference engines like vLLM and SGLang. It was trained from scratch on 5.7 trillion tokens using Muon, an optimizer based on matrix orthogonalization. Moonshot's contribution was making Muon scale: adding weight decay and keeping the per-parameter update RMS consistent. The team reports roughly 2x the sample efficiency of Adam, reaching comparable quality at about 52% of the training compute. Moonshot ships two variants on Hugging Face: the base Moonlight-16B-A3B and the chat-tuned Moonlight-16B-A3B-Instruct, both under the MIT license.

Despite activating only 2.24B parameters, Moonlight beats similarly-sized dense and MoE baselines (Llama3.2-3B, Qwen2.5-3B, DeepSeek-v2-Lite) across English, code, math, and Chinese benchmarks, advancing the performance-per-FLOP Pareto frontier. It is a text-only model with an 8K context window. Today it is overshadowed in Moonshot's own catalog by the far larger Kimi K2 models, but it remains an important reference point for anyone studying alternative optimizers and efficient MoE training.

LicenseMIT
WeightsOpen weights
Parameters15.29B total / 2.24B activated (marketed as 16B/3B, hence the "A3B" name)
Context8K tokens (8,192)
Max outputNot separately published; bounded by the 8K context window
ArchitectureSparse Mixture-of-Experts using the same architecture as DeepSeek-V3 (so it runs on vLLM and SGLang). 15.29B total parameters with only 2.24B activated per token. Trained from scratch on 5.7T tokens with the Muon optimizer rather than AdamW; Moonshot's scaling fixes were weight decay plus consistent per-parameter RMS update scaling, which the team reports as roughly 2x more sample-efficient than Adam (about 52% of the training FLOPs for comparable quality).
Knowledge cutoffNot officially published
Modalitiestext
StatusReleased February 2025; open weights remain available on Hugging Face. Superseded in Moonshot's lineup by the much larger Kimi K2 models, but not formally deprecated.

Benchmarks

  1. MMLU70%
  2. MMLU-pro42.4%
  3. BBH65.2%
  4. TriviaQA66.3%
  5. HumanEval48.1%
  6. MBPP63.8%
  7. GSM8K77.4%
  8. MATH45.3%
  9. C-Eval77.2%
  10. CMMLU78.2%

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Strengths

  • Strong quality for its tiny 2.24B active-parameter footprint, beating Llama3.2-3B and DeepSeek-v2-Lite across most benchmarks
  • Trained with the Muon optimizer, demonstrating roughly 2x the sample efficiency of Adam and about 52% of the training FLOPs for comparable results
  • Open weights under the permissive MIT license, for both base and instruct variants
  • Drop-in compatible with vLLM and SGLang thanks to the shared DeepSeek-V3 architecture
  • Cheap to serve: only 2.24B parameters are active per token despite 15.29B total

Best for

  • Research into alternative optimizers (Muon vs AdamW) and efficient MoE pretraining
  • Lightweight self-hosted chat and instruction-following where a small active footprint matters
  • On-prem code generation and math reasoning at low inference cost
  • English and Chinese language tasks (it was tuned on both)
  • A reference baseline for benchmarking new small MoE models

How to access

ProviderModel ID
Hugging Face (self-host / weights) ↗moonshotai/Moonlight-16B-A3B-Instruct
OpenRouter (free endpoint) ↗moonshotai/moonlight-16b-a3b-instruct

FAQ

What is Moonlight-16B-A3B?

It is an open-weight Mixture-of-Experts language model from Moonshot AI, released in February 2025. It has 15.29B total parameters but activates only 2.24B per token (hence the "16B/3B" / "A3B" naming), and its main claim to fame is being trained with the Muon optimizer instead of AdamW.

What makes Moonlight special compared to other small models?

It was used to prove that the Muon optimizer scales to real LLM training. Moonshot reports roughly 2x the sample efficiency of Adam, reaching comparable quality at about 52% of the training FLOPs. With only 2.24B active parameters it still beats Llama3.2-3B and DeepSeek-v2-Lite on most English, code, math, and Chinese benchmarks.

Is Moonlight-16B-A3B free and open source?

Yes. Both the base and Instruct variants are released under the MIT license on Hugging Face, so you can download, fine-tune, and self-host them freely. Because it shares the DeepSeek-V3 architecture, it runs on vLLM and SGLang.

What is its context window?

Moonlight-16B-A3B supports an 8K-token (8,192) context window and is a text-only model.