Overview
DeepSeek-V4-Flash is the smaller, faster, and cheaper of the two models in DeepSeek's V4 series, released as a preview on 24 April 2026 alongside the larger DeepSeek-V4-Pro. It is a Mixture-of-Experts language model with 284 billion total parameters, of which 13 billion are activated per token, and it natively supports a one-million-token context window with up to 384K tokens of output.
Like the Pro tier, DeepSeek-V4-Flash uses a Hybrid Attention Architecture that combines Compressed Sparse Attention and Heavily Compressed Attention. DeepSeek reports that in the 1M-token setting it needs only about 27% of the single-token inference FLOPs and 10% of the KV cache of DeepSeek-V3.2, which is what makes such a long context affordable. The model was pre-trained on more than 32 trillion tokens, runs in mixed FP4/FP8 precision (FP4 for MoE experts), and exposes three reasoning-effort modes: Non-think, Think High, and Think Max. DeepSeek says Flash in its Max mode approaches the reasoning quality of V4-Pro when given a larger thinking budget.
All weights are published on Hugging Face under the MIT License, so DeepSeek-V4-Flash is free for commercial use, fine-tuning, and self-hosting via vLLM, SGLang, or quantized GGUF builds in Ollama and LM Studio. On the DeepSeek API the legacy deepseek-chat and deepseek-reasoner aliases now route to Flash's non-thinking and thinking modes, and DeepSeek positions it as a low-cost agentic and coding workhorse at $0.14 per million input tokens.
| Released | 2026-04-24 |
|---|---|
| License | MIT |
| Weights | Open weights |
| Parameters | 284B total · 13B active |
| Context | 1M |
| Max output | 384K |
| Architecture | Mixture-of-Experts (Hybrid Attention: CSA + HCA) |
| Modalities | Text |
| Status | Preview |
Benchmarks
- MMLU-Pro (Max)86.2%
- GPQA Diamond (Max)88.1%
- HMMT 2026 Feb (Max)94.8%
- LiveCodeBench (Max)91.6%
- SWE Verified (Max)79%
- SimpleQA-Verified (Max)34.1%
- MRCR 1M (Max)78.7%
Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.
Pricing
| Input | $0.14 / 1M tokens |
|---|---|
| Cached input | $0.0028 / 1M tokens |
| Output | $0.28 / 1M tokens |
Strengths
- Open MIT-licensed weights, free for commercial use, fine-tuning, and local deployment via vLLM, SGLang, Ollama, or LM Studio
- Native one-million-token context with efficient hybrid attention (roughly 27% of V3.2's per-token FLOPs and 10% of its KV cache at 1M)
- Selectable reasoning effort (Non-think / Think High / Think Max) to trade latency for depth
- Very low API pricing — $0.14 per 1M input tokens and $0.28 per 1M output, about 12x cheaper than V4-Pro, with aggressive cache-hit discounts
- Strong coding and reasoning scores for its size, approaching V4-Pro in the Max mode (LiveCodeBench 91.6, SWE Verified 79.0)
Best for
- High-volume agentic and coding workflows where cost and latency matter more than absolute peak quality
- Long-document and large-codebase analysis that needs the full 1M-token window on a budget
- Self-hosted / on-prem deployment where open MIT weights and a smaller 13B-active footprint are required
- A drop-in replacement for the retiring deepseek-chat and deepseek-reasoner endpoints via its non-thinking and thinking modes
How to access
| Provider | Model ID |
|---|---|
| DeepSeek Platform ↗ | deepseek-v4-flash |
DeepSeek V4 — every version
The full lineage of the DeepSeek V4 line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.
| Version | Released | Context | License |
|---|---|---|---|
| DeepSeek-V4-Procurrent | 2026-04-24 | 1M | MIT |
| DeepSeek-V4-Flash | 2026-04-24 | — | MIT |
FAQ
Is DeepSeek-V4-Flash open source?
Yes. The weights are published on Hugging Face under the MIT License, one of the most permissive open licenses, so you can download, run, fine-tune, and use DeepSeek-V4-Flash commercially, including self-hosted and on-prem. It can be served with vLLM or SGLang, or run from quantized GGUF builds in Ollama and LM Studio.
How is DeepSeek-V4-Flash different from DeepSeek-V4-Pro?
Both shipped on 24 April 2026 and share the same 1M-token context, hybrid attention design, and MIT license. Flash is the smaller, cheaper tier — 284B total / 13B active parameters versus Pro's 1.6T / 49B — so it responds faster and costs about 12x less ($0.14 vs $0.435 per 1M input tokens). DeepSeek says Flash in its Max reasoning mode approaches Pro's quality given a larger thinking budget.
What are the reasoning modes in DeepSeek-V4-Flash?
It offers three reasoning-effort settings: Non-think for fast direct answers, Think High for deliberate step-by-step analysis, and Think Max (V4-Flash-Max) for maximum reasoning depth. The retiring deepseek-chat and deepseek-reasoner API aliases now route to Flash's non-thinking and thinking modes.
How much does the DeepSeek-V4-Flash API cost?
Per DeepSeek's official pricing page, DeepSeek-V4-Flash costs $0.14 per 1M input tokens on a cache miss and $0.28 per 1M output tokens. Cache hits drop the input price to about $0.0028 per 1M tokens, making repeated-context workloads very cheap.