AI/TLDR

vLLM Project · 2026-04-27 · major

vLLM v0.20.0 — DeepSeek V4, FlashAttention 4 Default, TurboQuant 2-bit KV Cache

752 commits from 320 contributors. Ships DeepSeek V4 support, FlashAttention 4 as default MLA prefill backend, TurboQuant 2-bit KV cache for 4× capacity, and CUDA 13 + PyTorch 2.11 as defaults.

vLLM inference engine logo

vLLM's biggest release of 2026: DeepSeek V4, FA4 as default, and 4× KV cache via 2-bit compression.

Key specs

GitHub stars79,196
Contributors320
Commits in release752

What is it?

vLLM is the de-facto open-source LLM inference and serving engine with 79k GitHub stars. Version 0.20.0 landed April 27, 2026, authored by 320 contributors across 752 commits — one of the project's largest releases ever.

How does it work?

The release ships DeepSeek V4 with DSML token-leakage fixes, enables FlashAttention 4 as the default MLA prefill backend with head-dim 512 support, and introduces TurboQuant — a new 2-bit KV cache compression backend that delivers 4× cache capacity. Default CUDA wheel moves to CUDA 13.0 with PyTorch 2.11; Python 3.14 is now supported and Transformers v5 compatibility is established.

Why does it matter?

The FlashAttention 4 default and TurboQuant together dramatically increase throughput and model density per GPU. DeepSeek V4 support means the latest frontier open-weight reasoning model is production-ready in vLLM the same week it launched.

Who is it for?

ML engineers and infrastructure teams running LLM APIs at scale, especially those serving MoE models or working with quantized KV caches.

Try it

pip install vllm==0.20.0

Sources · 2 outlets

Tags

  • inference
  • llm-serving
  • open-source
  • cuda
  • deepseek
  • quantization

← All releases · Learn AI