Overview
Kimi-VL-A3B-Thinking is the long-reasoning variant of Moonshot AI's Kimi-VL family, an open-weight Mixture-of-Experts vision-language model released in April 2025. Despite a roughly 16-billion-parameter footprint, it activates only about 2.8B parameters in its language decoder per step, pairing that efficient MoE backbone with the native-resolution MoonViT visual encoder and an MLP projector. The result is a small model that reads images, multi-image sets, video frames and long documents while keeping inference cheap.
Where the base Kimi-VL-A3B-Instruct focuses on perception, Kimi-VL-A3B-Thinking is tuned for step-by-step multimodal reasoning. The Kimi Team built it through long chain-of-thought supervised fine-tuning followed by reinforcement learning, so it spends extra tokens working through math diagrams, charts and visual logic before answering. On its release the model scored 61.7 on MMMU (val), 36.8 on MathVision and 71.3 on MathVista-mini at the compact 2.8B activated size, competitive with much larger vision models of its era.
Kimi-VL-A3B-Thinking ships under the permissive MIT license with full open weights on Hugging Face, a 128K-token context window, and a Hugging Face Spaces demo. Because it is genuinely open, it runs locally or on hosted endpoints such as Replicate, making it a practical pick for builders who need an inexpensive, auditable visual reasoner rather than a closed API. Note that Moonshot later shipped an upgraded Kimi-VL-A3B-Thinking-2506 checkpoint; this page covers the original April 2025 release.
| Released | 2025-04-10 |
|---|---|
| License | MIT |
| Weights | Open weights |
| Parameters | 16B total / ~2.8B activated (A3B MoE) |
| Context | 128K |
| Architecture | Mixture-of-Experts (MoE) language decoder based on Moonshot's Moonlight, paired with the native-resolution MoonViT visual encoder and an MLP projector. About 16B total parameters with only ~2.8B activated per step. Long chain-of-thought variant trained via CoT supervised fine-tuning plus reinforcement learning; 128K context. |
| Knowledge cutoff | Not disclosed |
| Modalities | Text, Vision, Video |
| Status | available |
Benchmarks
- MMMU (val, Pass@1)61.7%
- MathVista-mini (Pass@1)71.3%
- MathVision (full, Pass@1)36.8%
- InfoVQA83.2%
- LongVideoBench64.5%
- MMLongBench-Doc35.1%
- ScreenSpot-Pro34.5%
Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.
Strengths
- Strong multimodal chain-of-thought reasoning at a tiny ~2.8B activated-parameter cost
- Fully open weights under the permissive MIT license — local or self-hosted deployment
- Native-resolution MoonViT encoder handles high-resolution images, charts and documents
- 128K-token context for long documents, multi-image inputs and video frames
- Efficient MoE design keeps inference cost and latency low versus dense vision models
Best for
- Visual math and diagram reasoning (geometry, charts, plots)
- Document, OCR and long-PDF understanding over long context
- Multi-image and video-frame question answering
- On-device or self-hosted multimodal apps where closed APIs aren't an option
- Research and fine-tuning baselines for efficient MoE vision-language models
How to access
| Provider | Model ID |
|---|---|
| Hugging Face (weights) ↗ | moonshotai/Kimi-VL-A3B-Thinking |
| Replicate ↗ | zsxkib/kimi-vl-a3b-thinking |
Kimi-VL — every version
The full lineage of the Kimi-VL line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.
| Version | Released | Context | License |
|---|---|---|---|
| Kimi-VL-A3B-Thinking-2506current | 2025-06-21 | — | Open weights |
| Kimi-VL-A3B-Thinking | 2025-04 | — | Open weights |
| Kimi-VL-A3B-Instruct | 2025-04 | — | Open weights |
FAQ
Is Kimi-VL-A3B-Thinking open source?
Yes. Moonshot AI released the weights under the permissive MIT license on Hugging Face, so you can download, self-host, fine-tune and use it commercially. A hosted demo is also available on Hugging Face Spaces.
How many parameters does Kimi-VL-A3B-Thinking have?
It is a Mixture-of-Experts model with roughly 16B total parameters but only about 2.8B activated per step in its language decoder (the "A3B" naming). That makes it far cheaper to run than a dense model of similar capability.
What is the difference between Kimi-VL-A3B-Thinking and Kimi-VL-A3B-Instruct?
Instruct is tuned for general visual perception, while Thinking is a long chain-of-thought variant trained with CoT supervised fine-tuning plus reinforcement learning. Thinking spends extra reasoning tokens, which boosts math, chart and visual-logic accuracy.
What context window and inputs does it support?
It supports a 128K-token context and accepts text plus images (including multiple images), high-resolution inputs via the MoonViT encoder, video frames and long documents. Moonshot later shipped an improved Kimi-VL-A3B-Thinking-2506 checkpoint.
