AI/TLDR

DeepSeek-VL2

Open-weight Mixture-of-Experts vision-language models (Tiny/Small/Base) for OCR, document, table and chart understanding — strong accuracy at very few activated parameters.

Overview

DeepSeek-VL2 is an open-weight series of Mixture-of-Experts (MoE) vision-language models from DeepSeek, released on 13 December 2024 alongside the technical report arXiv:2412.10302. It is the second generation of the DeepSeek VL line and a substantial upgrade over the original dense DeepSeek-VL: the language backbone is now the sparse DeepSeekMoE architecture with Multi-head Latent Attention, and a dynamic tiling vision encoder lets it read high-resolution images of any aspect ratio. The result is a model that reaches competitive or state-of-the-art accuracy while activating only a small fraction of its weights per token.

The series ships in three sizes so you can match the model to your compute budget: DeepSeek-VL2-Tiny (3B total / 1.0B activated), DeepSeek-VL2-Small (16B total / 2.8B activated), and the full DeepSeek-VL2 (27B total / 4.5B activated). Each is built on the matching DeepSeekMoE base and uses a single SigLIP-SO400M-384 encoder. Because only the experts needed for a given input fire, even the 27B flagship runs with the inference cost of a roughly 4.5B-parameter dense model, and the Tiny variant is small enough for modest GPUs.

DeepSeek-VL2 targets practical multimodal understanding rather than open-ended chat: visual question answering, optical character recognition, document, table and chart reading, and visual grounding with bounding boxes. The weights are published on Hugging Face under the DeepSeek Model License (commercial use permitted), with the code released under MIT, so the models can be self-hosted or run through third-party services such as Replicate. As an open-weight download, DeepSeek-VL2 is not served on DeepSeek's own per-token API.

Released2024-12-13
LicenseDeepSeek Model License (commercial use permitted) · code MIT
WeightsOpen weights
ParametersThree MoE variants — Tiny: 3B total / 1.0B activated · Small: 16B total / 2.8B activated · Base: 27B total / 4.5B activated
Context4K
Max outputUndisclosed
ArchitectureMixture-of-Experts vision-language model. A single SigLIP-SO400M-384 vision encoder with a dynamic tiling strategy splits high-resolution images into 384x384 tiles to handle arbitrary aspect ratios; the language side is a DeepSeekMoE backbone with Multi-head Latent Attention (MLA), which compresses the key-value cache into latent vectors for efficient inference. All variants train at a 4096-token sequence length.
Knowledge cutoffUndisclosed (built on the DeepSeekMoE / DeepSeek-V2-era language backbone)
ModalitiesText, Vision
StatusGenerally available

Benchmarks

  1. DocVQA (test)93.3%
  2. ChartQA (test)86%
  3. TextVQA84.2%
  4. InfoVQA (val)78.1%
  5. OCRBench811
  6. AI2D (test)81.4%
  7. MMStar61.3%
  8. MathVista (testmini)62.8%

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Strengths

  • Strong document and OCR understanding — 93.3 on DocVQA and an 811 OCRBench score for the full model, competitive with much larger systems
  • Excellent chart and table reading (ChartQA 86.0, InfoVQA 78.1) for data-heavy images
  • MoE efficiency: the 27B flagship activates only ~4.5B parameters per token, so inference cost tracks a small dense model
  • Dynamic tiling vision encoder handles high-resolution images and extreme aspect ratios without cropping away detail
  • Three sizes (1.0B / 2.8B / 4.5B activated) let you trade accuracy for footprint, down to the Tiny variant for modest GPUs
  • Fully open weights under a license that permits commercial use, with MIT-licensed inference code
  • Visual grounding with bounding-box output, plus multi-image input support

Best for

  • Document AI: parsing scanned PDFs, forms, receipts and tables into structured answers
  • OCR and text extraction from photos, screenshots and dense documents
  • Chart, plot and infographic question answering for analytics workflows
  • General visual question answering over single or multiple images
  • Visual grounding — locating and bounding objects described in a prompt
  • Self-hosted multimodal deployments where open weights and commercial licensing are required

How to access

ProviderModel ID
Replicate ↗deepseek-ai/deepseek-vl2
Hugging Face (download weights) ↗deepseek-ai/deepseek-vl2

DeepSeek VL — every version

The full lineage of the DeepSeek VL line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

VersionReleasedContextLicense
DeepSeek-VL2current2024-12-13Open weights
DeepSeek-VL2024-03Open weights

FAQ

What is DeepSeek-VL2 and how is it different from DeepSeek-VL?

DeepSeek-VL2 is DeepSeek's second-generation, open-weight vision-language model series, released in December 2024. The main change from the original DeepSeek-VL is the architecture: the language backbone moved from a dense transformer to a sparse Mixture-of-Experts (DeepSeekMoE with Multi-head Latent Attention), and the vision side adopted a dynamic tiling encoder for high-resolution images. This gives stronger OCR, document and chart understanding while activating only a small fraction of the weights per token.

What sizes does DeepSeek-VL2 come in?

Three. DeepSeek-VL2-Tiny has 3B total parameters with 1.0B activated, DeepSeek-VL2-Small has 16B total with 2.8B activated, and the full DeepSeek-VL2 has 27B total with 4.5B activated. All three use the same SigLIP-SO400M-384 vision encoder and a matching DeepSeekMoE language backbone; the larger variants are more accurate, while Tiny fits on modest GPUs.

Is DeepSeek-VL2 open source and free to use commercially?

The weights are published on Hugging Face under the DeepSeek Model License, which permits commercial use, and the inference code is released under the MIT license. You can download and self-host all three variants, or run them through third-party services such as Replicate. There is no per-token DeepSeek API for VL2 — it is distributed as a downloadable open-weight model.

What is DeepSeek-VL2 best at?

Document and visual understanding. In the technical report the full model scores 93.3 on DocVQA, 86.0 on ChartQA, 84.2 on TextVQA, 78.1 on InfoVQA and 811 on OCRBench, making it especially strong for OCR, document, table and chart reading, alongside general visual question answering and visual grounding with bounding boxes.