AI/TLDR

vLLM

Fast, memory-efficient inference and serving for large language models

Overview

vLLM is a library for running and serving large language models. It loads models from Hugging Face and exposes them either as a Python API for batch generation or as an OpenAI-compatible HTTP server. It was originally developed in the Sky Computing Lab at UC Berkeley and is now maintained by a large open-source community.

It is built for teams that need to serve many requests at the same time without running out of GPU memory. Its PagedAttention technique manages the attention key/value cache more efficiently, and continuous batching keeps the GPU busy by adding and removing requests on the fly. It supports 200+ model architectures, including decoder-only LLMs, mixture-of-expert models, and multimodal models.

Within the inference and serving category, vLLM sits at the high-throughput serving end. You point it at a model, and it handles batching, memory management, streaming, and an API surface so you can focus on your application instead of the serving plumbing.

What it does

  • PagedAttention manages attention key/value memory efficiently, reducing waste and fitting more concurrent requests in GPU memory
  • Continuous batching, chunked prefill, and prefix caching keep throughput high under load
  • OpenAI-compatible API server, plus Anthropic Messages API and gRPC support
  • Wide quantization support including FP8, INT8, INT4, GPTQ/AWQ, and GGUF
  • Tensor, pipeline, data, expert, and context parallelism for distributed inference
  • Runs on NVIDIA and AMD GPUs, x86/ARM/PowerPC CPUs, and other hardware via plugins

Getting started

Install vLLM, then either generate text from Python or start an OpenAI-compatible server.

Install vLLM

Install with uv (recommended) or pip. A GPU with a matching CUDA setup is the common target.

bashbash
uv pip install vllm

Run offline batched inference

Load a model and generate text for a list of prompts directly in Python.

pythonpython
from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The capital of France is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="facebook/opt-125m")
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.prompt, output.outputs[0].text)

Start an OpenAI-compatible server

Serve a model over HTTP at http://localhost:8000 with chat and completion endpoints that match the OpenAI API.

bashbash
vllm serve Qwen/Qwen2.5-1.5B-Instruct

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Serving a chat or completion model to many users at once behind an OpenAI-compatible API
  • Running high-volume offline batch generation over large prompt sets
  • Hosting open-weight Hugging Face models on your own GPUs instead of a paid API
  • Scaling a single large model across multiple GPUs with tensor or pipeline parallelism

How vLLM compares

vLLM alongside other open-source serving & deployment tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Transformers★ 162kHugging Face Transformers is a Python framework that defines and runs state-of-the-art pretrained models for text, vision, audio, and multimodal tasks, for both inference and training.
vLLM★ 83.4kFast, memory-efficient inference and serving for large language models
SGLang★ 29.3kA serving framework for LLMs and multimodal models that boosts throughput by reusing shared prompt prefixes across requests.
TensorRT-LLM★ 13.9kNVIDIA's library that compiles LLMs into optimized engines for the fastest inference on its data-center GPUs.
OpenLLM★ 12.4kA tool to run any open-source LLM as an OpenAI-compatible API endpoint locally or in the cloud.
NVIDIA Triton Inference Server★ 10.8kA multi-framework model server that runs TensorRT, PyTorch, ONNX, and other models with dynamic batching and concurrent execution.
OpenVINO★ 10.4kAn open-source toolkit from Intel that converts and optimizes deep learning models, then runs fast inference on CPU, GPU, and NPU hardware.
LMCache★ 9.4kA KV-cache layer that stores and shares cached attention state across engines and requests to cut repeated computation.