Overview
Kimi-VL-A3B-Thinking-2506 is the current flagship of Moonshot AI's open-source Kimi-VL line, released on 21 June 2025. It is a Mixture-of-Experts vision-language model that carries 16B total parameters but activates only about 2.8B per token (the 'A3B' in the name), pairing the MoonViT native-resolution vision encoder with a Moonlight-16B-A3B MoE language decoder. The whole model is released under the permissive MIT license, so the weights can be downloaded, fine-tuned and self-hosted freely.
Unlike the original Kimi-VL-A3B-Thinking, the 2506 revision is a single model that is strong at both step-by-step reasoning and plain visual perception. Moonshot reports gains across multimodal-reasoning benchmarks (for example +20.1 points on MathVision and +8.4 on MathVista versus the first release) while using roughly 20% shorter thinking traces on average, and it matches the non-thinking Kimi-VL-A3B-Instruct on general perception tasks like MMBench, MMStar and RealWorldQA.
The model reads text, images, video and multi-page PDFs, supports a 128K-token context window, and handles high-resolution inputs of up to 3.2 million pixels (1792x1792) per image with up to 256 images per prompt. It can emit up to 32K output tokens, wrapping its reasoning in think tags. That combination makes it well suited to long-document understanding, video question answering, and GUI/OS-agent grounding tasks while staying cheap to run thanks to the sparse MoE design.
| Released | 2025-06-21 |
|---|---|
| License | MIT |
| Weights | Open weights |
| Parameters | 16B total / 2.8B active (MoE) |
| Context | 128K |
| Max output | 32K |
| Architecture | Mixture-of-Experts vision-language model. A native-resolution vision encoder (MoonViT) feeds an MoE language decoder based on Moonlight-16B-A3B: 16B total parameters with only ~2.8B activated per token. The 2506 update adds higher-resolution image support (up to 3.2M pixels / 1792x1792 per image, 256 images per prompt) and is tuned to reach answers with about 20% shorter chain-of-thought than the original Kimi-VL-A3B-Thinking. |
| Knowledge cutoff | December 2024 |
| Modalities | Text, Vision, Video, PDF |
| Status | Available |
Benchmarks
- MathVision56.9%
- MathVista (mini)80.1%
- MMMU (val)64%
- MMMU-Pro46.3%
- MMBench-EN-v1.184.4%
- MMStar70.4%
- RealWorldQA70%
- MMVet78.1%
- OCRBench869%
- VideoMMMU65.2%
- Video-MME71.9%
- V* Benchmark83.2%
- ScreenSpot-Pro52.8%
- OSWorld-G52.5%
- MMLongBench-DOC42.1%
Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.
Pricing
| Input | $0.025 / 1M tokens per 1M tokens |
|---|---|
| Output | $0.10 / 1M tokens per 1M tokens |
Moonshot AI does not list this model on its own platform; weights are MIT-licensed and self-hostable. The figures above are the paid OpenRouter endpoint; a free OpenRouter endpoint (kimi-vl-a3b-thinking:free) is also available.
Strengths
- Open weights under a permissive MIT license — free to download, fine-tune and self-host
- Efficient sparse MoE: 16B total parameters but only ~2.8B activated per token
- Strong multimodal math and reasoning (MathVision 56.9, MathVista 80.1) with ~20% shorter thinking traces than the prior release
- High-resolution vision via MoonViT — up to 3.2M pixels (1792x1792) per image, 256 images per prompt
- Open-source state of the art on VideoMMMU (65.2) for video reasoning
- Strong GUI / OS-agent grounding (ScreenSpot-Pro 52.8, OSWorld-G 52.5, V* 83.2)
- 128K-token context for long documents and multi-image or PDF inputs
Best for
- Visual and multimodal math/reasoning over charts, diagrams and screenshots
- Long-document and multi-page PDF understanding within a 128K context
- Video question answering and video reasoning
- GUI / OS-agent automation that grounds clicks on high-resolution UI screenshots
- OCR and high-resolution image analysis
- Self-hosted multimodal deployments where open weights and low activated-parameter cost matter
How to access
| Provider | Model ID |
|---|---|
| OpenRouter ↗ | moonshotai/kimi-vl-a3b-thinking |
| Hugging Face (weights) ↗ | moonshotai/Kimi-VL-A3B-Thinking-2506 |
Kimi-VL — every version
The full lineage of the Kimi-VL line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.
| Version | Released | Context | License |
|---|---|---|---|
| Kimi-VL-A3B-Thinking-2506current | 2025-06-21 | — | Open weights |
| Kimi-VL-A3B-Thinking | 2025-04 | — | Open weights |
| Kimi-VL-A3B-Instruct | 2025-04 | — | Open weights |
FAQ
What is Kimi-VL-A3B-Thinking-2506?
It is the current flagship of Moonshot AI's open-source Kimi-VL line, released on 21 June 2025. It is a Mixture-of-Experts vision-language model with 16B total parameters but only about 2.8B activated per token, combining the MoonViT vision encoder with a Moonlight-16B-A3B language decoder for multimodal reasoning over text, images, video and PDFs.
Is Kimi-VL-A3B-Thinking-2506 open source?
Yes. The weights are released on Hugging Face under the permissive MIT license, so you can download, fine-tune and self-host the model freely.
What context length and resolution does it support?
It supports a 128K-token (131,072) context window and high-resolution images of up to 3.2 million pixels (1792x1792) per image, with up to 256 images per prompt, and can output up to 32K tokens.
How much does it cost to use via an API?
Moonshot does not list this model on its own platform, but it is served on OpenRouter at roughly $0.025 per million input tokens and $0.10 per million output tokens, with a free endpoint also available. Because the weights are MIT-licensed, you can also run it yourself at no per-token cost.
