Kimi-VL-A3B-Instruct

Moonshot AI's lightweight open-weights MoE vision-language model — 16B total, 2.8B active, 128K context

Overview

Kimi-VL-A3B-Instruct is the instruction-tuned vision-language model in Moonshot AI's (Kimi) Kimi-VL line, released in April 2025 under an MIT license with fully open weights. It pairs a native-resolution visual encoder called MoonViT with a sparse Mixture-of-Experts language decoder: although the model totals 16B parameters, it activates only about 2.8B per token by routing through 8 of 384 experts, giving it the efficiency of a roughly 3B-class dense model while retaining the capacity of a much larger one.

The model handles text, single and multiple images, video sequences, and long documents within a 128K-token context window, making it well suited to OCR, college-level image and video comprehension, mathematical reasoning, and multi-image understanding. Moonshot AI positions Kimi-VL-A3B-Instruct as a compact, open alternative that competes with efficient closed and open VLMs such as GPT-4o-mini, Qwen2.5-VL-7B and Gemma-3-12B-IT, while surpassing GPT-4o in several specialized domains like document and OCR tasks.

Kimi-VL-A3B-Instruct is distributed primarily as open weights on Hugging Face for self-hosting rather than as a paid hosted API. A separate reasoning-focused sibling, Kimi-VL-A3B-Thinking, adds chain-of-thought and reinforcement-learning training on top of the same A3B backbone.

Released	2025-04
License	MIT
Weights	Open weights
Parameters	16B total / 2.8B activated (MoE, 8 of 384 experts)
Context	128K
Architecture	Mixture-of-Experts (MoE) vision-language model. A native-resolution visual encoder (MoonViT) feeds an MLP projector into a sparse MoE language decoder based on the Moonlight backbone: 16B total parameters with only ~2.8B activated per token, routing 8 of 384 experts. Supports a 128K-token context for multi-image, long-document and video inputs.
Modalities	Text, Vision, Video, PDF
Status	available

Benchmarks

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Strengths

Strong document understanding and OCR for its size — 86.7% on OCRBench and 83.2 on InfoVQA
Efficient MoE design: ~2.8B active parameters keep inference cheap while 16B total capacity preserves quality
128K context window enables long documents, multi-image sets and video clips in a single prompt
Capable GUI/agent grounding — 92.8 on ScreenSpot-V2 for on-screen element localization
Fully open weights under a permissive MIT license, so it can be self-hosted and fine-tuned freely
Solid math and college-level multimodal reasoning (68.7 MathVista, 57.0 MMMU) despite its small active footprint

Best for

OCR and document parsing across PDFs, scans and screenshots
Visual question answering over charts, diagrams and infographics
Long-document and multi-image analysis using the 128K context
Video comprehension and summarization
GUI agents and on-screen element grounding for automation
Self-hosted multimodal applications where open weights and low inference cost matter

Kimi-VL — every version

The full lineage of the Kimi-VL line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

Version	Released	Context	License
Kimi-VL-A3B-Thinking-2506current	2025-06-21	—	Open weights
Kimi-VL-A3B-Thinking	2025-04	—	Open weights
Kimi-VL-A3B-Instruct	2025-04	—	Open weights

FAQ

What is Kimi-VL-A3B-Instruct?

It is Moonshot AI's (Kimi) instruction-tuned vision-language model, released in April 2025. It uses a Mixture-of-Experts design with 16B total parameters but only about 2.8B activated per token, pairing a MoonViT visual encoder with a sparse MoE language decoder for text, image, video and document understanding.

Is Kimi-VL-A3B-Instruct open source?

Yes. The weights are openly published on Hugging Face under the permissive MIT license, so it can be self-hosted, fine-tuned and used commercially. There is no official paid hosted API for the Instruct variant — it is designed to run on your own hardware.

What context window does Kimi-VL-A3B-Instruct support?

It supports a 128K-token context window, large enough to handle long documents, multiple high-resolution images, or lengthy video clips alongside extensive text in a single prompt.

How does Kimi-VL-A3B-Instruct compare to other models?

Despite activating only about 2.8B parameters, it competes with efficient VLMs like GPT-4o-mini, Qwen2.5-VL-7B and Gemma-3-12B-IT, and surpasses GPT-4o in some specialized areas such as OCR (86.7 on OCRBench) and document understanding.

// Overview

// Benchmarks

// Strengths

// Best for

// Kimi-VL — every version

// FAQ

Overview

Benchmarks

Strengths

Best for

Kimi-VL — every version

FAQ