AI/TLDR

DeepSeek-V4-Flash

DeepSeek's lightweight V4 tier: a 284B / 13B-active open-weight MoE with a 1M-token context, near-Pro reasoning, and prices as low as $0.14 per million input tokens.

Overview

DeepSeek-V4-Flash is the smaller, faster, and cheaper of the two models in DeepSeek's V4 series, released as a preview on 24 April 2026 alongside the larger DeepSeek-V4-Pro. It is a Mixture-of-Experts language model with 284 billion total parameters, of which 13 billion are activated per token, and it natively supports a one-million-token context window with up to 384K tokens of output.

Like the Pro tier, DeepSeek-V4-Flash uses a Hybrid Attention Architecture that combines Compressed Sparse Attention and Heavily Compressed Attention. DeepSeek reports that in the 1M-token setting it needs only about 27% of the single-token inference FLOPs and 10% of the KV cache of DeepSeek-V3.2, which is what makes such a long context affordable. The model was pre-trained on more than 32 trillion tokens, runs in mixed FP4/FP8 precision (FP4 for MoE experts), and exposes three reasoning-effort modes: Non-think, Think High, and Think Max. DeepSeek says Flash in its Max mode approaches the reasoning quality of V4-Pro when given a larger thinking budget.

All weights are published on Hugging Face under the MIT License, so DeepSeek-V4-Flash is free for commercial use, fine-tuning, and self-hosting via vLLM, SGLang, or quantized GGUF builds in Ollama and LM Studio. On the DeepSeek API the legacy deepseek-chat and deepseek-reasoner aliases now route to Flash's non-thinking and thinking modes, and DeepSeek positions it as a low-cost agentic and coding workhorse at $0.14 per million input tokens.

Released2026-04-24
LicenseMIT
WeightsOpen weights
Parameters284B total · 13B active
Context1M
Max output384K
ArchitectureMixture-of-Experts (Hybrid Attention: CSA + HCA)
ModalitiesText
StatusPreview

Benchmarks

  1. MMLU-Pro (Max)86.2%
  2. GPQA Diamond (Max)88.1%
  3. HMMT 2026 Feb (Max)94.8%
  4. LiveCodeBench (Max)91.6%
  5. SWE Verified (Max)79%
  6. SimpleQA-Verified (Max)34.1%
  7. MRCR 1M (Max)78.7%

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Pricing

Input$0.14 / 1M tokens
Cached input$0.0028 / 1M tokens
Output$0.28 / 1M tokens

Pricing source ↗

Strengths

  • Open MIT-licensed weights, free for commercial use, fine-tuning, and local deployment via vLLM, SGLang, Ollama, or LM Studio
  • Native one-million-token context with efficient hybrid attention (roughly 27% of V3.2's per-token FLOPs and 10% of its KV cache at 1M)
  • Selectable reasoning effort (Non-think / Think High / Think Max) to trade latency for depth
  • Very low API pricing — $0.14 per 1M input tokens and $0.28 per 1M output, about 12x cheaper than V4-Pro, with aggressive cache-hit discounts
  • Strong coding and reasoning scores for its size, approaching V4-Pro in the Max mode (LiveCodeBench 91.6, SWE Verified 79.0)

Best for

  • High-volume agentic and coding workflows where cost and latency matter more than absolute peak quality
  • Long-document and large-codebase analysis that needs the full 1M-token window on a budget
  • Self-hosted / on-prem deployment where open MIT weights and a smaller 13B-active footprint are required
  • A drop-in replacement for the retiring deepseek-chat and deepseek-reasoner endpoints via its non-thinking and thinking modes

How to access

ProviderModel ID
DeepSeek Platform ↗deepseek-v4-flash

DeepSeek V4 — every version

The full lineage of the DeepSeek V4 line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

VersionReleasedContextLicense
DeepSeek-V4-Procurrent2026-04-241MMIT
DeepSeek-V4-Flash2026-04-24MIT

FAQ

Is DeepSeek-V4-Flash open source?

Yes. The weights are published on Hugging Face under the MIT License, one of the most permissive open licenses, so you can download, run, fine-tune, and use DeepSeek-V4-Flash commercially, including self-hosted and on-prem. It can be served with vLLM or SGLang, or run from quantized GGUF builds in Ollama and LM Studio.

How is DeepSeek-V4-Flash different from DeepSeek-V4-Pro?

Both shipped on 24 April 2026 and share the same 1M-token context, hybrid attention design, and MIT license. Flash is the smaller, cheaper tier — 284B total / 13B active parameters versus Pro's 1.6T / 49B — so it responds faster and costs about 12x less ($0.14 vs $0.435 per 1M input tokens). DeepSeek says Flash in its Max reasoning mode approaches Pro's quality given a larger thinking budget.

What are the reasoning modes in DeepSeek-V4-Flash?

It offers three reasoning-effort settings: Non-think for fast direct answers, Think High for deliberate step-by-step analysis, and Think Max (V4-Flash-Max) for maximum reasoning depth. The retiring deepseek-chat and deepseek-reasoner API aliases now route to Flash's non-thinking and thinking modes.

How much does the DeepSeek-V4-Flash API cost?

Per DeepSeek's official pricing page, DeepSeek-V4-Flash costs $0.14 per 1M input tokens on a cache miss and $0.28 per 1M output tokens. Cache hits drop the input price to about $0.0028 per 1M tokens, making repeated-context workloads very cheap.