DeepSeek-VL

Name: DeepSeek-VL
Author: DeepSeek

DeepSeek's first open-weight vision-language models — dense 1.3B and 7B with a hybrid SigLIP + SAM encoder for real-world image, document and chart understanding.

Overview

DeepSeek-VL is DeepSeek's first-generation, open-weight vision-language model series, released on 11 March 2024 alongside the technical report 'DeepSeek-VL: Towards Real-World Vision-Language Understanding' (arXiv:2403.05525). It comes in two dense sizes — DeepSeek-VL-1.3B and DeepSeek-VL-7B — each available as a Base model and an instruction-tuned Chat model. The design goal was practical, real-world multimodal understanding rather than benchmark chasing: the training data deliberately spans web screenshots, PDFs, OCR, charts and knowledge content, and the instruction-tuning set was built from a taxonomy of real user scenarios.

Architecturally, DeepSeek-VL pairs a dense DeepSeek-LLM language backbone (DeepSeek-LLM-1B for the 1.3B model, DeepSeek-LLM-7B for the 7B model) with a hybrid vision encoder. A SigLIP-L branch reads images at 384x384 for semantics while a SAM-B branch reads at 1024x1024 for fine detail; the two are fused into 576 visual tokens. This high-resolution path is what lets the model read documents, small text and dense charts without cropping away information, all at relatively low compute overhead. A key finding of the paper was that mixing language data into vision-language pretraining from the beginning preserves the model's text abilities — DeepSeek-VL-7B performs on par with the text-only DeepSeek-7B on language benchmarks.

DeepSeek-VL is distributed as an open-weight download: the four checkpoints (1.3b-base, 1.3b-chat, 7b-base, 7b-chat) are published on Hugging Face under the DeepSeek Model License, which permits commercial use, with the inference code released under MIT. There is no first-party per-token DeepSeek API for the VL line — it is meant to be self-hosted or run through third-party hosts such as Replicate. In December 2024 DeepSeek released the second-generation DeepSeek-VL2, which replaced the dense backbone with a Mixture-of-Experts architecture and a dynamic tiling encoder, superseding this original line.

Released	2024-03
License	DeepSeek Model License (commercial use permitted) · code MIT
Weights	Open weights
Parameters	Two dense variants — DeepSeek-VL-1.3B (built on DeepSeek-LLM-1B, ~2B params with the vision stack) and DeepSeek-VL-7B (built on DeepSeek-LLM-7B). Each ships as a Base and a Chat (instruction-tuned) model.
Context	4K (4096-token training sequence length)
Max output	Undisclosed
Architecture	Dense vision-language model with a hybrid two-encoder vision stack: a SigLIP-L encoder extracts coarse semantics at 384x384 and a SAM-B encoder captures fine detail at 1024x1024 (producing 64x64x256 feature maps). The two streams are fused and projected by a VL adaptor into 576 visual tokens of 2048 dimensions. The language side is a standard dense DeepSeek-LLM transformer (1B or 7B base). Training used a three-stage recipe (adaptor warm-up, joint VL pretraining, supervised fine-tuning) at a 4096-token sequence length, with LLM training integrated from the start to balance vision and language competition.
Knowledge cutoff	Undisclosed (language backbone DeepSeek-LLM, pretrained 2023)
Modalities	Text, Vision
Status	Generally available (open weights) — superseded by the MoE-based DeepSeek-VL2 in December 2024

Benchmarks

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Strengths

Open weights under a commercial-use license, with MIT-licensed inference code — fully self-hostable
Hybrid SigLIP-L + SAM-B encoder reads high-resolution images (up to 1024x1024) for documents, OCR and charts without aggressive downscaling
Preserves language ability: DeepSeek-VL-7B matches text-only DeepSeek-7B on language benchmarks (MMLU 52.4, HellaSwag 68.4)
Two sizes — a 1.3B variant light enough for modest GPUs and a stronger 7B variant
Trained on deliberately real-world data (web screenshots, PDFs, OCR, charts, knowledge content) rather than only academic VQA sets
Strong hallucination resistance for its size (POPE 88.1 for the 7B model)
Both Base and instruction-tuned Chat checkpoints released for each size

Best for

General visual question answering over photos and screenshots
Reading and answering questions about documents, PDFs and forms
OCR and text extraction from images and dense documents
Chart, plot and infographic understanding for analytics workflows
Self-hosted multimodal assistants where open weights and commercial licensing are required
On-device or modest-GPU multimodal deployment using the 1.3B variant
Research baseline for vision-language pretraining and instruction tuning

How to access

Provider	Model ID
Hugging Face (download weights) ↗	`deepseek-ai/deepseek-vl-7b-chat`
Replicate ↗	`deepseek-ai/deepseek-vl-7b-base`

DeepSeek VL — every version

The full lineage of the DeepSeek VL line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

Version	Released	Context	License
DeepSeek-VL2current	2024-12-13	—	Open weights
DeepSeek-VL	2024-03	—	Open weights

FAQ

What is DeepSeek-VL?

DeepSeek-VL is DeepSeek's first open-weight vision-language model series, released in March 2024. It pairs a dense DeepSeek-LLM language backbone with a hybrid vision encoder (SigLIP-L plus SAM-B) so it can understand images, documents, OCR text and charts. It ships in two sizes — 1.3B and 7B — each with a Base and an instruction-tuned Chat checkpoint.

What sizes and variants does DeepSeek-VL come in?

Four checkpoints across two sizes: deepseek-vl-1.3b-base, deepseek-vl-1.3b-chat, deepseek-vl-7b-base and deepseek-vl-7b-chat. The 1.3B models are built on DeepSeek-LLM-1B and the 7B models on DeepSeek-LLM-7B. Base models are pretrained foundations; Chat models are instruction-tuned for conversational multimodal use.

Is DeepSeek-VL open source and free for commercial use?

The weights are published on Hugging Face under the DeepSeek Model License, which permits commercial use, and the inference code is released under the MIT license. You can download and self-host all four checkpoints, or run them through third-party hosts such as Replicate. There is no first-party per-token DeepSeek API for the VL line.

How does DeepSeek-VL compare to DeepSeek-VL2?

DeepSeek-VL (March 2024) is the original dense series with a hybrid SigLIP-L + SAM-B encoder. DeepSeek-VL2 (December 2024) replaced the dense backbone with a sparse Mixture-of-Experts (DeepSeekMoE with Multi-head Latent Attention) and added a dynamic tiling vision encoder, delivering substantially stronger OCR, document and chart accuracy while activating only a fraction of its weights per token. VL2 supersedes this original line.

// Overview

// Benchmarks

// Strengths

// Best for

// How to access

// DeepSeek VL — every version

// FAQ