Overview
Kimi-VL-A3B-Instruct is the instruction-tuned vision-language model in Moonshot AI's (Kimi) Kimi-VL line, released in April 2025 under an MIT license with fully open weights. It pairs a native-resolution visual encoder called MoonViT with a sparse Mixture-of-Experts language decoder: although the model totals 16B parameters, it activates only about 2.8B per token by routing through 8 of 384 experts, giving it the efficiency of a roughly 3B-class dense model while retaining the capacity of a much larger one.
The model handles text, single and multiple images, video sequences, and long documents within a 128K-token context window, making it well suited to OCR, college-level image and video comprehension, mathematical reasoning, and multi-image understanding. Moonshot AI positions Kimi-VL-A3B-Instruct as a compact, open alternative that competes with efficient closed and open VLMs such as GPT-4o-mini, Qwen2.5-VL-7B and Gemma-3-12B-IT, while surpassing GPT-4o in several specialized domains like document and OCR tasks.
Kimi-VL-A3B-Instruct is distributed primarily as open weights on Hugging Face for self-hosting rather than as a paid hosted API. A separate reasoning-focused sibling, Kimi-VL-A3B-Thinking, adds chain-of-thought and reinforcement-learning training on top of the same A3B backbone.
| Released | 2025-04 |
|---|---|
| License | MIT |
| Weights | Open weights |
| Parameters | 16B total / 2.8B activated (MoE, 8 of 384 experts) |
| Context | 128K |
| Architecture | Mixture-of-Experts (MoE) vision-language model. A native-resolution visual encoder (MoonViT) feeds an MLP projector into a sparse MoE language decoder based on the Moonlight backbone: 16B total parameters with only ~2.8B activated per token, routing 8 of 384 experts. Supports a 128K-token context for multi-image, long-document and video inputs. |
| Modalities | Text, Vision, Video, PDF |
| Status | available |
Benchmarks
- MMMU (Val)57%
- MathVista68.7%
- MMBench-EN-v1.183.1%
- AI2D84.9%
- InfoVQA83.2%
- OCRBench86.7% (867/1000)
- ScreenSpot-V292.8%
- ScreenSpot-Pro34.5%
- LongVideoBench64.5%
- MMLongBench-Doc35.1%
- VideoMMMU52.6%
Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.
Strengths
- Strong document understanding and OCR for its size — 86.7% on OCRBench and 83.2 on InfoVQA
- Efficient MoE design: ~2.8B active parameters keep inference cheap while 16B total capacity preserves quality
- 128K context window enables long documents, multi-image sets and video clips in a single prompt
- Capable GUI/agent grounding — 92.8 on ScreenSpot-V2 for on-screen element localization
- Fully open weights under a permissive MIT license, so it can be self-hosted and fine-tuned freely
- Solid math and college-level multimodal reasoning (68.7 MathVista, 57.0 MMMU) despite its small active footprint
Best for
- OCR and document parsing across PDFs, scans and screenshots
- Visual question answering over charts, diagrams and infographics
- Long-document and multi-image analysis using the 128K context
- Video comprehension and summarization
- GUI agents and on-screen element grounding for automation
- Self-hosted multimodal applications where open weights and low inference cost matter
Kimi-VL — every version
The full lineage of the Kimi-VL line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.
| Version | Released | Context | License |
|---|---|---|---|
| Kimi-VL-A3B-Thinking-2506current | 2025-06-21 | — | Open weights |
| Kimi-VL-A3B-Thinking | 2025-04 | — | Open weights |
| Kimi-VL-A3B-Instruct | 2025-04 | — | Open weights |
FAQ
What is Kimi-VL-A3B-Instruct?
It is Moonshot AI's (Kimi) instruction-tuned vision-language model, released in April 2025. It uses a Mixture-of-Experts design with 16B total parameters but only about 2.8B activated per token, pairing a MoonViT visual encoder with a sparse MoE language decoder for text, image, video and document understanding.
Is Kimi-VL-A3B-Instruct open source?
Yes. The weights are openly published on Hugging Face under the permissive MIT license, so it can be self-hosted, fine-tuned and used commercially. There is no official paid hosted API for the Instruct variant — it is designed to run on your own hardware.
What context window does Kimi-VL-A3B-Instruct support?
It supports a 128K-token context window, large enough to handle long documents, multiple high-resolution images, or lengthy video clips alongside extensive text in a single prompt.
How does Kimi-VL-A3B-Instruct compare to other models?
Despite activating only about 2.8B parameters, it competes with efficient VLMs like GPT-4o-mini, Qwen2.5-VL-7B and Gemma-3-12B-IT, and surpasses GPT-4o in some specialized areas such as OCR (86.7 on OCRBench) and document understanding.
