Overview
vLLM is a library for running and serving large language models. It loads models from Hugging Face and exposes them either as a Python API for batch generation or as an OpenAI-compatible HTTP server. It was originally developed in the Sky Computing Lab at UC Berkeley and is now maintained by a large open-source community.
It is built for teams that need to serve many requests at the same time without running out of GPU memory. Its PagedAttention technique manages the attention key/value cache more efficiently, and continuous batching keeps the GPU busy by adding and removing requests on the fly. It supports 200+ model architectures, including decoder-only LLMs, mixture-of-expert models, and multimodal models.
Within the inference and serving category, vLLM sits at the high-throughput serving end. You point it at a model, and it handles batching, memory management, streaming, and an API surface so you can focus on your application instead of the serving plumbing.
What it does
- PagedAttention manages attention key/value memory efficiently, reducing waste and fitting more concurrent requests in GPU memory
- Continuous batching, chunked prefill, and prefix caching keep throughput high under load
- OpenAI-compatible API server, plus Anthropic Messages API and gRPC support
- Wide quantization support including FP8, INT8, INT4, GPTQ/AWQ, and GGUF
- Tensor, pipeline, data, expert, and context parallelism for distributed inference
- Runs on NVIDIA and AMD GPUs, x86/ARM/PowerPC CPUs, and other hardware via plugins
Getting started
Install vLLM, then either generate text from Python or start an OpenAI-compatible server.
Install vLLM
Install with uv (recommended) or pip. A GPU with a matching CUDA setup is the common target.
uv pip install vllmRun offline batched inference
Load a model and generate text for a list of prompts directly in Python.
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The capital of France is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="facebook/opt-125m")
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.prompt, output.outputs[0].text)Start an OpenAI-compatible server
Serve a model over HTTP at http://localhost:8000 with chat and completion endpoints that match the OpenAI API.
vllm serve Qwen/Qwen2.5-1.5B-InstructCommands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Serving a chat or completion model to many users at once behind an OpenAI-compatible API
- Running high-volume offline batch generation over large prompt sets
- Hosting open-weight Hugging Face models on your own GPUs instead of a paid API
- Scaling a single large model across multiple GPUs with tensor or pipeline parallelism
How vLLM compares
vLLM alongside other open-source serving & deployment tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Transformers | ★ 162k | Hugging Face Transformers is a Python framework that defines and runs state-of-the-art pretrained models for text, vision, audio, and multimodal tasks, for both inference and training. |
| vLLM | ★ 83.4k | Fast, memory-efficient inference and serving for large language models |
| SGLang | ★ 29.3k | A serving framework for LLMs and multimodal models that boosts throughput by reusing shared prompt prefixes across requests. |
| TensorRT-LLM | ★ 13.9k | NVIDIA's library that compiles LLMs into optimized engines for the fastest inference on its data-center GPUs. |
| OpenLLM | ★ 12.4k | A tool to run any open-source LLM as an OpenAI-compatible API endpoint locally or in the cloud. |
| NVIDIA Triton Inference Server | ★ 10.8k | A multi-framework model server that runs TensorRT, PyTorch, ONNX, and other models with dynamic batching and concurrent execution. |
| OpenVINO | ★ 10.4k | An open-source toolkit from Intel that converts and optimizes deep learning models, then runs fast inference on CPU, GPU, and NPU hardware. |
| LMCache | ★ 9.4k | A KV-cache layer that stores and shares cached attention state across engines and requests to cut repeated computation. |