AI/TLDR

DeepSeek-VL

DeepSeek's first open-weight vision-language models — dense 1.3B and 7B with a hybrid SigLIP + SAM encoder for real-world image, document and chart understanding.

Overview

DeepSeek-VL is DeepSeek's first-generation, open-weight vision-language model series, released on 11 March 2024 alongside the technical report 'DeepSeek-VL: Towards Real-World Vision-Language Understanding' (arXiv:2403.05525). It comes in two dense sizes — DeepSeek-VL-1.3B and DeepSeek-VL-7B — each available as a Base model and an instruction-tuned Chat model. The design goal was practical, real-world multimodal understanding rather than benchmark chasing: the training data deliberately spans web screenshots, PDFs, OCR, charts and knowledge content, and the instruction-tuning set was built from a taxonomy of real user scenarios.

Architecturally, DeepSeek-VL pairs a dense DeepSeek-LLM language backbone (DeepSeek-LLM-1B for the 1.3B model, DeepSeek-LLM-7B for the 7B model) with a hybrid vision encoder. A SigLIP-L branch reads images at 384x384 for semantics while a SAM-B branch reads at 1024x1024 for fine detail; the two are fused into 576 visual tokens. This high-resolution path is what lets the model read documents, small text and dense charts without cropping away information, all at relatively low compute overhead. A key finding of the paper was that mixing language data into vision-language pretraining from the beginning preserves the model's text abilities — DeepSeek-VL-7B performs on par with the text-only DeepSeek-7B on language benchmarks.

DeepSeek-VL is distributed as an open-weight download: the four checkpoints (1.3b-base, 1.3b-chat, 7b-base, 7b-chat) are published on Hugging Face under the DeepSeek Model License, which permits commercial use, with the inference code released under MIT. There is no first-party per-token DeepSeek API for the VL line — it is meant to be self-hosted or run through third-party hosts such as Replicate. In December 2024 DeepSeek released the second-generation DeepSeek-VL2, which replaced the dense backbone with a Mixture-of-Experts architecture and a dynamic tiling encoder, superseding this original line.

Released2024-03
LicenseDeepSeek Model License (commercial use permitted) · code MIT
WeightsOpen weights
ParametersTwo dense variants — DeepSeek-VL-1.3B (built on DeepSeek-LLM-1B, ~2B params with the vision stack) and DeepSeek-VL-7B (built on DeepSeek-LLM-7B). Each ships as a Base and a Chat (instruction-tuned) model.
Context4K (4096-token training sequence length)
Max outputUndisclosed
ArchitectureDense vision-language model with a hybrid two-encoder vision stack: a SigLIP-L encoder extracts coarse semantics at 384x384 and a SAM-B encoder captures fine detail at 1024x1024 (producing 64x64x256 feature maps). The two streams are fused and projected by a VL adaptor into 576 visual tokens of 2048 dimensions. The language side is a standard dense DeepSeek-LLM transformer (1B or 7B base). Training used a three-stage recipe (adaptor warm-up, joint VL pretraining, supervised fine-tuning) at a 4096-token sequence length, with LLM training integrated from the start to balance vision and language competition.
Knowledge cutoffUndisclosed (language backbone DeepSeek-LLM, pretrained 2023)
ModalitiesText, Vision
StatusGenerally available (open weights) — superseded by the MoE-based DeepSeek-VL2 in December 2024

Benchmarks

  1. MMBench (7B)73.2%
  2. MMBench (1.3B)64.6%
  3. SEEDBench (7B)70.4%
  4. SEEDBench (1.3B)66.7%
  5. MMMU (7B)36.6%
  6. MMMU (1.3B)32.2%
  7. MM-Vet (7B)41.5%
  8. MathVista (7B)36.1%
  9. POPE (7B)88.1%
  10. OCRBench (7B)456
  11. MMLU (7B Chat, language)52.4%
  12. HellaSwag (7B Chat, language)68.4%

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Strengths

  • Open weights under a commercial-use license, with MIT-licensed inference code — fully self-hostable
  • Hybrid SigLIP-L + SAM-B encoder reads high-resolution images (up to 1024x1024) for documents, OCR and charts without aggressive downscaling
  • Preserves language ability: DeepSeek-VL-7B matches text-only DeepSeek-7B on language benchmarks (MMLU 52.4, HellaSwag 68.4)
  • Two sizes — a 1.3B variant light enough for modest GPUs and a stronger 7B variant
  • Trained on deliberately real-world data (web screenshots, PDFs, OCR, charts, knowledge content) rather than only academic VQA sets
  • Strong hallucination resistance for its size (POPE 88.1 for the 7B model)
  • Both Base and instruction-tuned Chat checkpoints released for each size

Best for

  • General visual question answering over photos and screenshots
  • Reading and answering questions about documents, PDFs and forms
  • OCR and text extraction from images and dense documents
  • Chart, plot and infographic understanding for analytics workflows
  • Self-hosted multimodal assistants where open weights and commercial licensing are required
  • On-device or modest-GPU multimodal deployment using the 1.3B variant
  • Research baseline for vision-language pretraining and instruction tuning

How to access

ProviderModel ID
Hugging Face (download weights) ↗deepseek-ai/deepseek-vl-7b-chat
Replicate ↗deepseek-ai/deepseek-vl-7b-base

DeepSeek VL — every version

The full lineage of the DeepSeek VL line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

VersionReleasedContextLicense
DeepSeek-VL2current2024-12-13Open weights
DeepSeek-VL2024-03Open weights

FAQ

What is DeepSeek-VL?

DeepSeek-VL is DeepSeek's first open-weight vision-language model series, released in March 2024. It pairs a dense DeepSeek-LLM language backbone with a hybrid vision encoder (SigLIP-L plus SAM-B) so it can understand images, documents, OCR text and charts. It ships in two sizes — 1.3B and 7B — each with a Base and an instruction-tuned Chat checkpoint.

What sizes and variants does DeepSeek-VL come in?

Four checkpoints across two sizes: deepseek-vl-1.3b-base, deepseek-vl-1.3b-chat, deepseek-vl-7b-base and deepseek-vl-7b-chat. The 1.3B models are built on DeepSeek-LLM-1B and the 7B models on DeepSeek-LLM-7B. Base models are pretrained foundations; Chat models are instruction-tuned for conversational multimodal use.

Is DeepSeek-VL open source and free for commercial use?

The weights are published on Hugging Face under the DeepSeek Model License, which permits commercial use, and the inference code is released under the MIT license. You can download and self-host all four checkpoints, or run them through third-party hosts such as Replicate. There is no first-party per-token DeepSeek API for the VL line.

How does DeepSeek-VL compare to DeepSeek-VL2?

DeepSeek-VL (March 2024) is the original dense series with a hybrid SigLIP-L + SAM-B encoder. DeepSeek-VL2 (December 2024) replaced the dense backbone with a sparse Mixture-of-Experts (DeepSeekMoE with Multi-head Latent Attention) and added a dynamic tiling vision encoder, delivering substantially stronger OCR, document and chart accuracy while activating only a fraction of its weights per token. VL2 supersedes this original line.