Overview
DeepSeek-V2 is an open-weight Mixture-of-Experts large language model released by Chinese AI lab DeepSeek in May 2024. It has 236 billion total parameters but activates only 21 billion per token, which is what lets a model this large run cheaply. It supports a 128K-token context window and was pretrained on 8.1 trillion tokens of text and code.
Its two headline ideas are DeepSeekMoE and Multi-head Latent Attention (MLA). DeepSeekMoE splits the feed-forward layers into 2 shared experts plus 160 routed experts and uses just 6 of them per token, so most of the network sits idle on any given step. MLA compresses the key-value cache that normally dominates inference memory, and DeepSeek reports it cuts that cache by 93.3% and boosts maximum throughput 5.76x compared with the older dense DeepSeek 67B. Together these techniques are why DeepSeek-V2 could be served so cheaply.
DeepSeek-V2 is best remembered for triggering a price war: its launch API rates were so low (the Financial Times reported roughly 2 RMB per million output tokens) that other Chinese labs quickly cut their own prices. The model is now discontinued — DeepSeek replaced it with DeepSeek-V2.5 in September 2024 and the much larger DeepSeek-V3 in December 2024 — but its open weights remain on Hugging Face and its MLA + MoE recipe carried directly into those successors.
| Released | 2024-05 |
|---|---|
| License | DeepSeek Model License (source-available; commercial use permitted). Repository code is MIT-licensed. |
| Weights | Open weights |
| Parameters | 236B total, 21B activated per token (Mixture-of-Experts) |
| Context | 128K tokens |
| Architecture | Mixture-of-Experts (MoE) decoder-only Transformer using DeepSeekMoE for the feed-forward layers (2 shared experts + 160 routed experts, 6 activated per token) and Multi-head Latent Attention (MLA), which compresses the key-value cache via low-rank joint compression. 236B total parameters with 21B activated per token; pretrained on 8.1 trillion tokens. DeepSeek reports a 93.3% KV-cache reduction and 5.76x higher maximum generation throughput versus the earlier dense DeepSeek 67B. |
| Knowledge cutoff | Not officially published by DeepSeek |
| Modalities | text |
| Status | Discontinued — superseded by DeepSeek-V2.5 (September 2024) and DeepSeek-V3 (December 2024). Open weights remain available on Hugging Face; the original hosted API endpoint has long since moved to newer models. |
Benchmarks
- MMLU (Base)78.5%
- BBH (Base)78.9%
- C-Eval (Base)81.7%
- CMMLU (Base)84%
- GSM8K (Chat)92.2%
- HumanEval (Chat)81.1%
- MATH (Chat)53.9%
- MMLU (Chat)77.8%
Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.
Strengths
- Extremely cheap to serve for its size — only 21B of 236B parameters activate per token
- Multi-head Latent Attention cuts KV-cache memory by 93.3%, enabling long contexts at low cost
- Strong coding and math scores for an open model of its era (HumanEval 81.1, GSM8K 92.2 on the Chat model)
- 128K-token context window with reliable long-context retrieval
- Open weights under a commercially-permissive license, with an MIT-licensed code repository
- Strong bilingual (Chinese + English) performance — C-Eval 81.7, CMMLU 84.0 on the base model
Best for
- Low-cost, high-throughput text generation and chat at scale
- Coding assistance and code generation
- Math and reasoning tasks
- Long-document question answering and summarization within a 128K window
- Chinese and English bilingual applications
- Self-hosting on your own GPUs (e.g. via vLLM) when you need open weights and data control
- A historical reference / baseline for studying MoE and MLA architectures
How to access
| Provider | Model ID |
|---|---|
| DeepSeek ↗ | deepseek-chat (historical; endpoint has since moved to newer models) |
| Hugging Face ↗ | deepseek-ai/DeepSeek-V2 |
DeepSeek V3 — every version
The full lineage of the DeepSeek V3 line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.
| Version | Released | Context | License |
|---|---|---|---|
| DeepSeek-V3.2current | 2025-12-01 | — | Open weights |
| DeepSeek-V3.2-Speciale | 2025-12-01 | — | Open weights |
| DeepSeek-V3.2-Exp | 2025-09-29 | — | Open weights |
| DeepSeek-V3.1-Terminus | 2025-09-22 | — | Open weights |
| DeepSeek-V3.1 | 2025-08-21 | — | Open weights |
| DeepSeek-V3-0324 | 2025-03-24 | — | Open weights |
| DeepSeek-V3 | 2024-12-26 | — | Open weights |
| DeepSeek-V2.5 | 2024-09-05 | — | Open weights |
| DeepSeek-V2 | 2024-05 | — | Open weights |
FAQ
What is DeepSeek-V2?
DeepSeek-V2 is an open-weight Mixture-of-Experts large language model released by DeepSeek in May 2024. It has 236 billion total parameters but activates only 21 billion per token, supports a 128K-token context window, and was trained on 8.1 trillion tokens.
Is DeepSeek-V2 still available?
Its open weights are still downloadable on Hugging Face, but the model is discontinued. DeepSeek replaced it with DeepSeek-V2.5 in September 2024 and DeepSeek-V3 in December 2024, and the hosted API now serves newer models.
What made DeepSeek-V2 important?
Two things. Technically, it introduced Multi-head Latent Attention (MLA) and the DeepSeekMoE design, cutting KV-cache memory by 93.3% and boosting throughput 5.76x versus the dense DeepSeek 67B. Commercially, its very low launch price set off a price war among Chinese AI labs.
How big is DeepSeek-V2 and how much runs at once?
It has 236B total parameters but only 21B activate per token because it is a Mixture-of-Experts model. Each MoE layer uses 2 shared experts plus 160 routed experts and selects just 6 of them per token, which keeps inference cheap despite the large total size.