Kimi-VL-A3B-Thinking

Moonshot AI's open MoE vision-language reasoner that activates just 2.8B parameters

Overview

Kimi-VL-A3B-Thinking is the long-reasoning variant of Moonshot AI's Kimi-VL family, an open-weight Mixture-of-Experts vision-language model released in April 2025. Despite a roughly 16-billion-parameter footprint, it activates only about 2.8B parameters in its language decoder per step, pairing that efficient MoE backbone with the native-resolution MoonViT visual encoder and an MLP projector. The result is a small model that reads images, multi-image sets, video frames and long documents while keeping inference cheap.

Where the base Kimi-VL-A3B-Instruct focuses on perception, Kimi-VL-A3B-Thinking is tuned for step-by-step multimodal reasoning. The Kimi Team built it through long chain-of-thought supervised fine-tuning followed by reinforcement learning, so it spends extra tokens working through math diagrams, charts and visual logic before answering. On its release the model scored 61.7 on MMMU (val), 36.8 on MathVision and 71.3 on MathVista-mini at the compact 2.8B activated size, competitive with much larger vision models of its era.

Kimi-VL-A3B-Thinking ships under the permissive MIT license with full open weights on Hugging Face, a 128K-token context window, and a Hugging Face Spaces demo. Because it is genuinely open, it runs locally or on hosted endpoints such as Replicate, making it a practical pick for builders who need an inexpensive, auditable visual reasoner rather than a closed API. Note that Moonshot later shipped an upgraded Kimi-VL-A3B-Thinking-2506 checkpoint; this page covers the original April 2025 release.

Released	2025-04-10
License	MIT
Weights	Open weights
Parameters	16B total / ~2.8B activated (A3B MoE)
Context	128K
Architecture	Mixture-of-Experts (MoE) language decoder based on Moonshot's Moonlight, paired with the native-resolution MoonViT visual encoder and an MLP projector. About 16B total parameters with only ~2.8B activated per step. Long chain-of-thought variant trained via CoT supervised fine-tuning plus reinforcement learning; 128K context.
Knowledge cutoff	Not disclosed
Modalities	Text, Vision, Video
Status	available

Benchmarks

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Strengths

Strong multimodal chain-of-thought reasoning at a tiny ~2.8B activated-parameter cost
Fully open weights under the permissive MIT license — local or self-hosted deployment
Native-resolution MoonViT encoder handles high-resolution images, charts and documents
128K-token context for long documents, multi-image inputs and video frames
Efficient MoE design keeps inference cost and latency low versus dense vision models

Best for

Visual math and diagram reasoning (geometry, charts, plots)
Document, OCR and long-PDF understanding over long context
Multi-image and video-frame question answering
On-device or self-hosted multimodal apps where closed APIs aren't an option
Research and fine-tuning baselines for efficient MoE vision-language models

How to access

Provider	Model ID
Hugging Face (weights) ↗	`moonshotai/Kimi-VL-A3B-Thinking`
Replicate ↗	`zsxkib/kimi-vl-a3b-thinking`

Kimi-VL — every version

The full lineage of the Kimi-VL line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

Version	Released	Context	License
Kimi-VL-A3B-Thinking-2506current	2025-06-21	—	Open weights
Kimi-VL-A3B-Thinking	2025-04	—	Open weights
Kimi-VL-A3B-Instruct	2025-04	—	Open weights

FAQ

Is Kimi-VL-A3B-Thinking open source?

Yes. Moonshot AI released the weights under the permissive MIT license on Hugging Face, so you can download, self-host, fine-tune and use it commercially. A hosted demo is also available on Hugging Face Spaces.

How many parameters does Kimi-VL-A3B-Thinking have?

It is a Mixture-of-Experts model with roughly 16B total parameters but only about 2.8B activated per step in its language decoder (the "A3B" naming). That makes it far cheaper to run than a dense model of similar capability.

What is the difference between Kimi-VL-A3B-Thinking and Kimi-VL-A3B-Instruct?

Instruct is tuned for general visual perception, while Thinking is a long chain-of-thought variant trained with CoT supervised fine-tuning plus reinforcement learning. Thinking spends extra reasoning tokens, which boosts math, chart and visual-logic accuracy.

What context window and inputs does it support?

It supports a 128K-token context and accepts text plus images (including multiple images), high-resolution inputs via the MoonViT encoder, video frames and long documents. Moonshot later shipped an improved Kimi-VL-A3B-Thinking-2506 checkpoint.

// Overview

// Benchmarks

// Strengths

// Best for

// How to access

// Kimi-VL — every version

// FAQ