AI/TLDR

What Is vLLM?

You will understand how vLLM's PagedAttention and continuous batching squeeze dramatically more throughput from a single GPU than naive serving.

INTERMEDIATE12 MIN READUPDATED 2026-06-12

In plain English

vLLM is an open-source inference server built at UC Berkeley that became the de-facto standard for high-throughput LLM serving. If you run an open model and need to serve real traffic — not just test it yourself — vLLM is almost certainly part of the answer. It achieves up to 24× the throughput of a naive HuggingFace Transformers setup by rethinking how a GPU's memory is managed and how requests are scheduled through the model.

The core problem vLLM solves is this: every request needs its own KV cache — a growing scratchpad that stores the model's intermediate attention results for every token generated so far. Without a smart memory manager, you waste 60–80% of GPU memory on KV caches that are reserved but mostly empty, limiting how many users you can serve at once. vLLM introduced PagedAttention, an attention algorithm that manages KV cache memory like an operating system manages RAM, slashing that waste to under 4%.

Think of PagedAttention the way a hotel manages rooms. A traditional system reserves an entire floor for each guest — just in case they need it — even when they only occupy one room. PagedAttention rents out individual rooms on demand and reassigns them the moment a guest checks out. Same building, many more guests served.

Why it matters

vLLM matters because GPU time is expensive and LLM generation is slow. A server-grade H100 GPU can cost several dollars per hour in the cloud. The difference between a well-tuned vLLM deployment and a naive one can be 10× in total tokens served per dollar — that gap translates directly into whether a product is economically viable.

  • Throughput — total tokens served per second across all users. More throughput = lower cost per answer. vLLM's batching and paging are designed to push this number up.
  • Latency — how long one user waits. Two sub-metrics matter: time to first token (TTFT), the wait before any text appears, and inter-token latency (ITL), how fast tokens stream after that. Larger batches improve throughput but slightly increase TTFT.
  • Memory efficiency — the fraction of GPU VRAM actually used for live token state versus wasted on empty reservations. Higher efficiency = more concurrent users per card.
  • OpenAI-compatibility — because vLLM mirrors the OpenAI API exactly, you can use the standard Python openai SDK, LangChain, LlamaIndex, or any other tool without modification.

By mid-2025 vLLM had grown to nearly 50 000 GitHub stars and reached a v1.0 release. Major cloud providers including Google Cloud and Red Hat ship it as a supported serving backend. It is the most widely deployed open-source LLM serving engine in production.

How it works: PagedAttention and continuous batching

vLLM has two signature mechanisms. They solve different problems but work together to achieve its performance.

PagedAttention: virtual memory for the KV cache

Every transformer layer's attention mechanism computes keys and values for every token and must revisit them on subsequent steps. The collection of these results is the KV cache. It grows one entry per token per layer, so a 70-billion-parameter model answering a 2 000-token request accumulates gigabytes of KV data — and you might have 100 such requests in flight simultaneously.

Before PagedAttention, servers pre-allocated a contiguous chunk of memory for each request sized to the maximum allowed output length. If a request finished early, that memory sat idle until the whole slot was released. Fragmentation and padding meant the majority of VRAM was wasted at any given moment.

PagedAttention splits each request's KV cache into fixed-size pages (typically 16 tokens each). A block table — a per-request mapping, analogous to a CPU's page table — records which physical memory pages hold which logical token positions. The GPU's attention kernel is rewritten to consult this table and gather non-contiguous pages, assembling the right keys and values on the fly. The result: memory is handed out only when a new page is actually needed, and freed the instant a request finishes.

A bonus feature follows naturally: copy-on-write sharing. Two requests with identical prefixes (a shared system prompt, for example) can point their block tables at the same physical pages for those early tokens. Only when a request diverges does vLLM allocate a private copy. This is the mechanism behind prefix caching, discussed below.

Continuous batching: keeping the GPU saturated

PagedAttention solves memory; continuous batching solves GPU utilization. In a static batch, the server collects N requests, runs them all together until every single one finishes, then accepts new requests. The GPU stalls waiting for the slowest member, and newly-arriving requests queue outside regardless of how much capacity is available.

vLLM's scheduler runs on every token step (every forward pass of the model). After each step it checks: which requests finished? Which are waiting? It evicts the finished ones and admits new ones immediately, so the batch is always as full as memory allows. This means a short request finishing mid-batch frees its KV pages and a new request slots in on the very next token step — the GPU never sees an idle cycle from finished work.

The prefill / decode split

Every request has two distinct phases. Prefill processes the full input prompt in one big parallel pass — it is compute-bound, using the GPU's arithmetic units intensively. Decode generates tokens one at a time — it is memory-bandwidth-bound, dominated by reading the KV cache. vLLM v1.0 introduced chunked prefill: breaking large prompts into chunks and interleaving them with decode steps. This prevents one long-prompt request from stalling the decode steps of all other requests, reducing TTFT for concurrent users.

2025 features: prefix caching, speculative decoding, and more

The v1.0 release and subsequent updates in 2025 added several features that are now standard in production deployments.

Prefix caching (automatic KV cache reuse)

When many requests share a common prefix — a system prompt, a large retrieved document, a few-shot example block — vLLM computes that prefix's KV pages once and reuses them across every request that shares it. Subsequent calls with the same prefix skip prefill entirely for those tokens, reducing TTFT from seconds to milliseconds for RAG workloads and agent loops that resend the same context repeatedly.

FlashAttention-3 kernels

vLLM integrates FlashAttention — optimized GPU kernels that fuse attention computation to reduce memory reads and writes — as its default attention backend. FlashAttention-3, available on Hopper-generation GPUs (H100), adds asynchronous pipelining and FP8 support for further speed gains. You get this automatically; no configuration required.

Speculative decoding

Decode is slow because each token requires a full forward pass of a large model. Speculative decoding uses a tiny draft model to guess several tokens ahead, then the large model verifies all guesses in a single pass. When the guesses are correct (they typically are for predictable tokens), you produce multiple tokens for the cost of one forward pass — same output quality, meaningfully lower latency.

Multi-GPU tensor parallelism

Models too large for one GPU can be sharded across several using tensor parallelism: each GPU holds a slice of every weight matrix and they cooperate on every forward pass. vLLM exposes this via the --tensor-parallel-size flag. A single vllm serve command can coordinate an 8-GPU node and present it as one API endpoint.

Deploying vLLM as an OpenAI-compatible server

vLLM's practical appeal is that it drops in as a local replacement for any OpenAI-compatible client. Install it, run one command, and any application using the standard OpenAI SDK works unchanged.

bashbash
# Install vLLM (requires Python 3.9+, CUDA GPU)
pip install vllm

# Serve a model — downloads from Hugging Face on first run
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90

# The server now listens on http://localhost:8000/v1

The --gpu-memory-utilization flag (default 0.90) tells vLLM what fraction of VRAM to pre-allocate for the KV cache pool. The rest is reserved for the model weights. Tuning this value is the primary lever for balancing max concurrent users against leaving headroom for the weights.

pythonpython
from openai import OpenAI

# Point the standard OpenAI client at your local vLLM server.
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-required",  # vLLM ignores this by default
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a concise assistant."},
        {"role": "user", "content": "What is PagedAttention?"},
    ],
    temperature=0.7,
    stream=True,
)

for chunk in response:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

Send ten copies of this script simultaneously and vLLM batches all ten through the GPU together — PagedAttention allocates KV pages for each independently, continuous batching keeps every slot filled, and streaming tokens flow back to each client as they're generated. You wrote zero batching code; vLLM handles it entirely.

Useful launch flags

  • --tensor-parallel-size N — split the model across N GPUs on one node
  • --max-model-len 4096 — cap the KV cache per request; shorter = more concurrent users
  • --quantization awq (or gptq, fp8) — load a quantized checkpoint to halve VRAM usage
  • --enable-prefix-caching — turn on automatic KV page reuse for shared prefixes
  • --max-num-seqs 256 — limit concurrent sequences; prevents OOM under traffic spikes
  • --served-model-name my-llm — override the model name returned in API responses (useful behind a proxy)

Going deeper

Once you have vLLM running, there are several directions to explore depending on your workload.

Throughput vs. latency: the fundamental trade-off

Larger batches improve throughput (more total tokens per second) but increase time to first token for individual requests, because the prefill step of a new arrival has to wait its turn in a crowded batch. The right operating point depends on your use case: a chatbot needs low TTFT; an offline document-processing pipeline wants maximum throughput. vLLM exposes --max-num-batched-tokens and scheduling knobs to tune this. There is no universal answer — benchmark with your actual traffic shape.

KV cache as the new bottleneck

As context windows grow toward millions of tokens, the KV cache — not the model weights — becomes the dominant VRAM consumer. A 128K-token context at a 70B model easily exceeds 80 GB per request. This is why KV-cache compression, quantizing the KV cache to INT8 or FP8, and offloading cold pages to CPU memory are active research areas, and why vLLM has first-class support for FP8 KV cache quantization on H100s.

vLLM vs. SGLang vs. TensorRT-LLM

vLLM is not the only option. SGLang (also from Berkeley) optimizes specifically for structured generation and multi-call agent programs, often outperforming vLLM for complex prompt flows. TensorRT-LLM (NVIDIA) applies aggressive CUDA kernel fusion and INT8/INT4 quantization and is fastest on NVIDIA hardware with a supported model — but requires more setup. TGI (Hugging Face) is simpler and integrates tightly with HF model cards. The honest answer: run a benchmark with your model, your traffic shape, and your GPU before committing. vLLM is the default because it is the most flexible and widely documented, not because it wins every microbenchmark.

Production realities

A single vLLM process is an engine, not a system. Real deployments layer on autoscaling (spinning GPU replicas up with traffic and down when idle), a load balancer that routes requests across replicas, request queuing to absorb spikes, and observability tooling to track TTFT, throughput, and error rates. That operational layer is what LLMOps covers. vLLM's OpenAI-compatible API makes this straightforward: put any HTTP proxy in front of it and replicas are transparent to clients.

FAQ

What is vLLM and what is it used for?

vLLM is an open-source LLM inference and serving engine created at UC Berkeley. It is used to run open-weights language models in production, serving many concurrent users from a single GPU process. It is most commonly deployed as an OpenAI-compatible HTTP server, so existing applications can switch from a hosted API to a self-hosted model by changing one URL.

What is PagedAttention in vLLM?

PagedAttention is vLLM's memory management algorithm for the KV cache. Instead of reserving a large contiguous block of GPU memory per request, it breaks each request's KV cache into small fixed-size pages (typically 16 tokens) and allocates them on demand, like virtual memory in an OS. This reduces wasted GPU memory from 60–80% to under 4%, allowing far more concurrent requests on the same hardware.

How much faster is vLLM than HuggingFace Transformers?

In the benchmarks from the original vLLM paper and subsequent evaluations, vLLM achieves up to 24× the throughput of a naive HuggingFace Transformers setup and up to 3.5× the throughput of HuggingFace's own Text Generation Inference server. Real-world gains depend heavily on your traffic pattern, model, and hardware, but 4–10× improvements under realistic concurrent load are common.

Does vLLM work with any model?

vLLM supports most major open-weight model architectures available on Hugging Face, including Llama, Mistral, Qwen, Falcon, Gemma, Phi, and many others. It does not support every model automatically — check the vLLM documentation's list of supported models before choosing. Very new model architectures typically gain support within a few weeks of release.

What is the difference between vLLM and Ollama?

Ollama is optimized for a single user running a model easily on their own machine, including laptops and CPUs. vLLM is a production inference server optimized for high concurrency on server-grade GPUs. Ollama is easier to set up and handles one or two users well; vLLM is harder to configure but serves hundreds of concurrent users efficiently from the same hardware via continuous batching and PagedAttention.

What GPU do I need to run vLLM?

vLLM requires an NVIDIA GPU with CUDA support (CUDA 11.8+). It works on any modern NVIDIA card, but practically: a 7–8B model needs at least 16 GB VRAM in BF16 (e.g. a single A10G or 3090). Larger models require more VRAM — vLLM's tensor parallelism lets you spread a model across multiple GPUs. Apple Silicon (MPS) and AMD ROCm support exists but is less mature.

Further reading