AI/TLDR

Kimi-VL-A3B-Instruct

Moonshot AI's lightweight open-weights MoE vision-language model — 16B total, 2.8B active, 128K context

Overview

Kimi-VL-A3B-Instruct is the instruction-tuned vision-language model in Moonshot AI's (Kimi) Kimi-VL line, released in April 2025 under an MIT license with fully open weights. It pairs a native-resolution visual encoder called MoonViT with a sparse Mixture-of-Experts language decoder: although the model totals 16B parameters, it activates only about 2.8B per token by routing through 8 of 384 experts, giving it the efficiency of a roughly 3B-class dense model while retaining the capacity of a much larger one.

The model handles text, single and multiple images, video sequences, and long documents within a 128K-token context window, making it well suited to OCR, college-level image and video comprehension, mathematical reasoning, and multi-image understanding. Moonshot AI positions Kimi-VL-A3B-Instruct as a compact, open alternative that competes with efficient closed and open VLMs such as GPT-4o-mini, Qwen2.5-VL-7B and Gemma-3-12B-IT, while surpassing GPT-4o in several specialized domains like document and OCR tasks.

Kimi-VL-A3B-Instruct is distributed primarily as open weights on Hugging Face for self-hosting rather than as a paid hosted API. A separate reasoning-focused sibling, Kimi-VL-A3B-Thinking, adds chain-of-thought and reinforcement-learning training on top of the same A3B backbone.

Released2025-04
LicenseMIT
WeightsOpen weights
Parameters16B total / 2.8B activated (MoE, 8 of 384 experts)
Context128K
ArchitectureMixture-of-Experts (MoE) vision-language model. A native-resolution visual encoder (MoonViT) feeds an MLP projector into a sparse MoE language decoder based on the Moonlight backbone: 16B total parameters with only ~2.8B activated per token, routing 8 of 384 experts. Supports a 128K-token context for multi-image, long-document and video inputs.
ModalitiesText, Vision, Video, PDF
Statusavailable

Benchmarks

  1. MMMU (Val)57%
  2. MathVista68.7%
  3. MMBench-EN-v1.183.1%
  4. AI2D84.9%
  5. InfoVQA83.2%
  6. OCRBench86.7% (867/1000)
  7. ScreenSpot-V292.8%
  8. ScreenSpot-Pro34.5%
  9. LongVideoBench64.5%
  10. MMLongBench-Doc35.1%
  11. VideoMMMU52.6%

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Strengths

  • Strong document understanding and OCR for its size — 86.7% on OCRBench and 83.2 on InfoVQA
  • Efficient MoE design: ~2.8B active parameters keep inference cheap while 16B total capacity preserves quality
  • 128K context window enables long documents, multi-image sets and video clips in a single prompt
  • Capable GUI/agent grounding — 92.8 on ScreenSpot-V2 for on-screen element localization
  • Fully open weights under a permissive MIT license, so it can be self-hosted and fine-tuned freely
  • Solid math and college-level multimodal reasoning (68.7 MathVista, 57.0 MMMU) despite its small active footprint

Best for

  • OCR and document parsing across PDFs, scans and screenshots
  • Visual question answering over charts, diagrams and infographics
  • Long-document and multi-image analysis using the 128K context
  • Video comprehension and summarization
  • GUI agents and on-screen element grounding for automation
  • Self-hosted multimodal applications where open weights and low inference cost matter

Kimi-VL — every version

The full lineage of the Kimi-VL line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

VersionReleasedContextLicense
Kimi-VL-A3B-Thinking-2506current2025-06-21Open weights
Kimi-VL-A3B-Thinking2025-04Open weights
Kimi-VL-A3B-Instruct2025-04Open weights

FAQ

What is Kimi-VL-A3B-Instruct?

It is Moonshot AI's (Kimi) instruction-tuned vision-language model, released in April 2025. It uses a Mixture-of-Experts design with 16B total parameters but only about 2.8B activated per token, pairing a MoonViT visual encoder with a sparse MoE language decoder for text, image, video and document understanding.

Is Kimi-VL-A3B-Instruct open source?

Yes. The weights are openly published on Hugging Face under the permissive MIT license, so it can be self-hosted, fine-tuned and used commercially. There is no official paid hosted API for the Instruct variant — it is designed to run on your own hardware.

What context window does Kimi-VL-A3B-Instruct support?

It supports a 128K-token context window, large enough to handle long documents, multiple high-resolution images, or lengthy video clips alongside extensive text in a single prompt.

How does Kimi-VL-A3B-Instruct compare to other models?

Despite activating only about 2.8B parameters, it competes with efficient VLMs like GPT-4o-mini, Qwen2.5-VL-7B and Gemma-3-12B-IT, and surpasses GPT-4o in some specialized areas such as OCR (86.7 on OCRBench) and document understanding.