AI/TLDR

Kimi Linear (48B-A3B)

Moonshot AI's open-weight hybrid linear-attention LLM built for fast 1M-token context

Overview

Kimi Linear (48B-A3B) is an open-weight large language model from Moonshot AI (the team behind Kimi). It is the flagship release of the Kimi Linear line, published on October 31, 2025 alongside the technical report "Kimi Linear: An Expressive, Efficient Attention Architecture." Moonshot ships it in two checkpoints on Hugging Face — Kimi-Linear-48B-A3B-Base and Kimi-Linear-48B-A3B-Instruct — both under the permissive MIT license.

The model is a Mixture-of-Experts design with 48B total parameters but only about 3B active per token (hence the "A3B" name), trained on 5.7T tokens. Its headline feature is the attention stack: instead of using full attention everywhere, Kimi Linear interleaves three layers of Kimi Delta Attention (KDA) — a linear-attention mechanism that refines Gated DeltaNet with finer gating — for every one layer of Multi-Head Latent Attention (MLA). This 3:1 hybrid is meant to be a drop-in replacement for full attention.

The payoff is long-context efficiency. Moonshot reports the architecture cuts KV-cache memory by up to 75% and delivers up to 6x faster decoding throughput at a 1M-token context, while matching or beating full attention on quality. That makes Kimi Linear (48B-A3B) most interesting to people who want to self-host a long-context model cheaply rather than to chase the absolute top of the reasoning leaderboards.

Released2025-10-31
LicenseMIT
WeightsOpen weights
Parameters48B total, 3B active (MoE)
Context1M
ArchitectureHybrid linear-attention Mixture-of-Experts. Stacks Kimi Delta Attention (KDA) — a refined Gated DeltaNet with fine-grained channel-wise gating — and full Multi-Head Latent Attention (MLA) layers in a 3:1 ratio (3 KDA layers per MLA layer). 48B total parameters with ~3B activated per token. Trained on 5.7T tokens.
Knowledge cutoffNot disclosed
ModalitiesText
StatusAvailable

Benchmarks

  1. MMLU-Pro (4k context)51%
  2. RULER (128k context)84.3%

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Strengths

  • Native 1M-token context window with hardware-efficient KDA kernels
  • Up to 75% smaller KV cache versus standard full-attention models
  • Up to 6x faster decoding throughput at 1M-token context (6.3x faster time-per-output-token vs MLA per the paper)
  • Sparse MoE: only ~3B of 48B parameters active per token, keeping inference cost low
  • Fully open weights under the permissive MIT license (Base + Instruct checkpoints)
  • Open-sourced KDA kernels with vLLM and Flash Linear Attention (FLA) integration

Best for

  • Self-hosted long-context applications (large-document QA, multi-file code analysis) where KV-cache memory is the bottleneck
  • High-throughput batch and agent workloads that need fast decoding at long context
  • Cost-sensitive deployments that want an MoE model with low active-parameter count
  • Research and experimentation on linear / hybrid attention architectures, using the open KDA kernels
  • Fine-tuning the Base checkpoint for domain-specific tasks under a permissive MIT license

How to access

ProviderModel ID
Hugging Face (self-host / open weights) ↗moonshotai/Kimi-Linear-48B-A3B-Instruct
Featherless AI (serverless) ↗moonshotai/Kimi-Linear-48B-A3B-Instruct

FAQ

What is Kimi Linear (48B-A3B)?

It is an open-weight large language model from Moonshot AI, released on October 31, 2025. It uses a hybrid linear-attention architecture and is a Mixture-of-Experts model with 48B total parameters but only about 3B active per token. It supports a 1M-token context window and ships under the MIT license.

What makes Kimi Linear's architecture different?

Instead of full attention in every layer, it interleaves three layers of Kimi Delta Attention (KDA, a refined Gated DeltaNet) for every one Multi-Head Latent Attention (MLA) layer. Moonshot reports this 3:1 hybrid cuts KV-cache memory by up to 75% and gives up to 6x faster decoding at a 1M-token context, while matching or beating full attention on quality.

Is Kimi Linear (48B-A3B) free and open source?

Yes. Both the Base and Instruct checkpoints are released on Hugging Face under the MIT license, so you can download, run, and fine-tune them yourself. The KDA kernels are also open-sourced with vLLM and Flash Linear Attention (FLA) support.

How can I run Kimi Linear (48B-A3B)?

You can self-host the open weights from Hugging Face (vLLM with the FLA kernel is the recommended path), or use a third-party serverless provider such as Featherless AI. Moonshot's own hosted Kimi API currently lists its K2 models rather than Kimi Linear, and there is no published first-party per-token price for this model.