AI/TLDR

Kimi-VL-A3B-Thinking

Moonshot AI's open MoE vision-language reasoner that activates just 2.8B parameters

Overview

Kimi-VL-A3B-Thinking is the long-reasoning variant of Moonshot AI's Kimi-VL family, an open-weight Mixture-of-Experts vision-language model released in April 2025. Despite a roughly 16-billion-parameter footprint, it activates only about 2.8B parameters in its language decoder per step, pairing that efficient MoE backbone with the native-resolution MoonViT visual encoder and an MLP projector. The result is a small model that reads images, multi-image sets, video frames and long documents while keeping inference cheap.

Where the base Kimi-VL-A3B-Instruct focuses on perception, Kimi-VL-A3B-Thinking is tuned for step-by-step multimodal reasoning. The Kimi Team built it through long chain-of-thought supervised fine-tuning followed by reinforcement learning, so it spends extra tokens working through math diagrams, charts and visual logic before answering. On its release the model scored 61.7 on MMMU (val), 36.8 on MathVision and 71.3 on MathVista-mini at the compact 2.8B activated size, competitive with much larger vision models of its era.

Kimi-VL-A3B-Thinking ships under the permissive MIT license with full open weights on Hugging Face, a 128K-token context window, and a Hugging Face Spaces demo. Because it is genuinely open, it runs locally or on hosted endpoints such as Replicate, making it a practical pick for builders who need an inexpensive, auditable visual reasoner rather than a closed API. Note that Moonshot later shipped an upgraded Kimi-VL-A3B-Thinking-2506 checkpoint; this page covers the original April 2025 release.

Released2025-04-10
LicenseMIT
WeightsOpen weights
Parameters16B total / ~2.8B activated (A3B MoE)
Context128K
ArchitectureMixture-of-Experts (MoE) language decoder based on Moonshot's Moonlight, paired with the native-resolution MoonViT visual encoder and an MLP projector. About 16B total parameters with only ~2.8B activated per step. Long chain-of-thought variant trained via CoT supervised fine-tuning plus reinforcement learning; 128K context.
Knowledge cutoffNot disclosed
ModalitiesText, Vision, Video
Statusavailable

Benchmarks

  1. MMMU (val, Pass@1)61.7%
  2. MathVista-mini (Pass@1)71.3%
  3. MathVision (full, Pass@1)36.8%
  4. InfoVQA83.2%
  5. LongVideoBench64.5%
  6. MMLongBench-Doc35.1%
  7. ScreenSpot-Pro34.5%

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Strengths

  • Strong multimodal chain-of-thought reasoning at a tiny ~2.8B activated-parameter cost
  • Fully open weights under the permissive MIT license — local or self-hosted deployment
  • Native-resolution MoonViT encoder handles high-resolution images, charts and documents
  • 128K-token context for long documents, multi-image inputs and video frames
  • Efficient MoE design keeps inference cost and latency low versus dense vision models

Best for

  • Visual math and diagram reasoning (geometry, charts, plots)
  • Document, OCR and long-PDF understanding over long context
  • Multi-image and video-frame question answering
  • On-device or self-hosted multimodal apps where closed APIs aren't an option
  • Research and fine-tuning baselines for efficient MoE vision-language models

How to access

ProviderModel ID
Hugging Face (weights) ↗moonshotai/Kimi-VL-A3B-Thinking
Replicate ↗zsxkib/kimi-vl-a3b-thinking

Kimi-VL — every version

The full lineage of the Kimi-VL line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

VersionReleasedContextLicense
Kimi-VL-A3B-Thinking-2506current2025-06-21Open weights
Kimi-VL-A3B-Thinking2025-04Open weights
Kimi-VL-A3B-Instruct2025-04Open weights

FAQ

Is Kimi-VL-A3B-Thinking open source?

Yes. Moonshot AI released the weights under the permissive MIT license on Hugging Face, so you can download, self-host, fine-tune and use it commercially. A hosted demo is also available on Hugging Face Spaces.

How many parameters does Kimi-VL-A3B-Thinking have?

It is a Mixture-of-Experts model with roughly 16B total parameters but only about 2.8B activated per step in its language decoder (the "A3B" naming). That makes it far cheaper to run than a dense model of similar capability.

What is the difference between Kimi-VL-A3B-Thinking and Kimi-VL-A3B-Instruct?

Instruct is tuned for general visual perception, while Thinking is a long chain-of-thought variant trained with CoT supervised fine-tuning plus reinforcement learning. Thinking spends extra reasoning tokens, which boosts math, chart and visual-logic accuracy.

What context window and inputs does it support?

It supports a 128K-token context and accepts text plus images (including multiple images), high-resolution inputs via the MoonViT encoder, video frames and long documents. Moonshot later shipped an improved Kimi-VL-A3B-Thinking-2506 checkpoint.