vLLM

Fast, memory-efficient inference and serving for large language models

github.com/vllm-project/vllm★ 83.4k docs.vllm.ai

Overview

vLLM is a library for running and serving large language models. It loads models from Hugging Face and exposes them either as a Python API for batch generation or as an OpenAI-compatible HTTP server. It was originally developed in the Sky Computing Lab at UC Berkeley and is now maintained by a large open-source community.

It is built for teams that need to serve many requests at the same time without running out of GPU memory. Its PagedAttention technique manages the attention key/value cache more efficiently, and continuous batching keeps the GPU busy by adding and removing requests on the fly. It supports 200+ model architectures, including decoder-only LLMs, mixture-of-expert models, and multimodal models.

Within the inference and serving category, vLLM sits at the high-throughput serving end. You point it at a model, and it handles batching, memory management, streaming, and an API surface so you can focus on your application instead of the serving plumbing.

What it does

PagedAttention manages attention key/value memory efficiently, reducing waste and fitting more concurrent requests in GPU memory
Continuous batching, chunked prefill, and prefix caching keep throughput high under load
OpenAI-compatible API server, plus Anthropic Messages API and gRPC support
Wide quantization support including FP8, INT8, INT4, GPTQ/AWQ, and GGUF
Tensor, pipeline, data, expert, and context parallelism for distributed inference
Runs on NVIDIA and AMD GPUs, x86/ARM/PowerPC CPUs, and other hardware via plugins

Getting started

Install vLLM, then either generate text from Python or start an OpenAI-compatible server.

Install vLLM

Install with uv (recommended) or pip. A GPU with a matching CUDA setup is the common target.

bashbash

uv pip install vllm

Run offline batched inference

Load a model and generate text for a list of prompts directly in Python.

pythonpython

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The capital of France is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="facebook/opt-125m")
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.prompt, output.outputs[0].text)

Start an OpenAI-compatible server

Serve a model over HTTP at http://localhost:8000 with chat and completion endpoints that match the OpenAI API.

bashbash

vllm serve Qwen/Qwen2.5-1.5B-Instruct

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Serving a chat or completion model to many users at once behind an OpenAI-compatible API
Running high-volume offline batch generation over large prompt sets
Hosting open-weight Hugging Face models on your own GPUs instead of a paid API
Scaling a single large model across multiple GPUs with tensor or pipeline parallelism

How vLLM compares

vLLM alongside other open-source serving & deployment tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Transformers	★ 162k	Hugging Face Transformers is a Python framework that defines and runs state-of-the-art pretrained models for text, vision, audio, and multimodal tasks, for both inference and training.
vLLM	★ 83.4k	Fast, memory-efficient inference and serving for large language models
SGLang	★ 29.3k	A serving framework for LLMs and multimodal models that boosts throughput by reusing shared prompt prefixes across requests.
TensorRT-LLM	★ 13.9k	NVIDIA's library that compiles LLMs into optimized engines for the fastest inference on its data-center GPUs.
OpenLLM	★ 12.4k	A tool to run any open-source LLM as an OpenAI-compatible API endpoint locally or in the cloud.
NVIDIA Triton Inference Server	★ 10.8k	A multi-framework model server that runs TensorRT, PyTorch, ONNX, and other models with dynamic batching and concurrent execution.
OpenVINO	★ 10.4k	An open-source toolkit from Intel that converts and optimizes deep learning models, then runs fast inference on CPU, GPU, and NPU hardware.
LMCache	★ 9.4k	A KV-cache layer that stores and shares cached attention state across engines and requests to cut repeated computation.

// Overview

// What it does

// Getting started