SGLang

Fast serving for LLMs and multimodal models with prefix-cache reuse

github.com/sgl-project/sglang★ 29.3k docs.sglang.ai

Overview

SGLang is a serving framework for large language models and multimodal models. It runs your model behind an HTTP server and answers requests through an OpenAI-compatible API, so existing clients can talk to it with little change. It is built to give low-latency, high-throughput inference from a single GPU up to large distributed clusters.

Its main idea is to reuse work across requests. RadixAttention caches shared prompt prefixes, so when many requests start with the same system prompt or few-shot examples, the engine does not recompute that part each time. Combined with continuous batching and a low-overhead scheduler, this raises the number of tokens a server can produce per second.

It fits the high-throughput serving category alongside engines you put in front of a model in production. It supports a wide range of language models (Llama, Qwen, DeepSeek, GLM, Gemma, Mistral and more), embedding models, reward models, and diffusion models, and offers quantization, tensor and pipeline parallelism, structured outputs, and multi-LoRA batching.

What it does

RadixAttention prefix caching reuses shared prompt prefixes across requests to cut repeated computation
OpenAI-compatible HTTP server, so existing OpenAI client code can point at your local endpoint
Continuous batching, paged attention, chunked prefill, and a zero-overhead CPU scheduler for higher throughput
Tensor, pipeline, expert, and data parallelism to scale from one GPU to multi-GPU clusters
Quantization support (FP8, FP4, INT4, AWQ, GPTQ) plus multi-LoRA batching
Broad model coverage: language, embedding, reward, and diffusion models including Llama, Qwen, DeepSeek, GLM, Gemma, and Mistral

Getting started

Install SGLang, launch a server for a model, then send requests to its OpenAI-compatible endpoint on port 30000.

Install SGLang

The docs recommend installing with uv for a faster setup.

bashbash

pip install --upgrade pip
pip install uv
uv pip install sglang

Launch the server

Start the server with a model path. By default it listens on port 30000.

bashbash

python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct \
  --host 0.0.0.0 --log-level warning

Send a request

Query the OpenAI-compatible chat endpoint with curl. The base URL is http://localhost:30000/v1.

bashbash

curl -s http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen/qwen2.5-0.5b-instruct", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Serve an open model behind an OpenAI-compatible API so existing app code works without rewriting client calls
Run chat or agent workloads where many requests share a long system prompt, letting prefix caching skip repeated work
Scale inference across multiple GPUs with tensor or pipeline parallelism for large models like DeepSeek or Qwen
Serve multiple LoRA adapters in one batch, or run quantized models to fit larger workloads on limited GPU memory

How SGLang compares

SGLang alongside other open-source serving & deployment tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Transformers	★ 162k	Hugging Face Transformers is a Python framework that defines and runs state-of-the-art pretrained models for text, vision, audio, and multimodal tasks, for both inference and training.
vLLM	★ 83.4k	A high-throughput LLM serving engine that uses PagedAttention and continuous batching to serve many requests at once.
SGLang	★ 29.3k	Fast serving for LLMs and multimodal models with prefix-cache reuse
TensorRT-LLM	★ 13.9k	NVIDIA's library that compiles LLMs into optimized engines for the fastest inference on its data-center GPUs.
OpenLLM	★ 12.4k	A tool to run any open-source LLM as an OpenAI-compatible API endpoint locally or in the cloud.
NVIDIA Triton Inference Server	★ 10.8k	A multi-framework model server that runs TensorRT, PyTorch, ONNX, and other models with dynamic batching and concurrent execution.
OpenVINO	★ 10.4k	An open-source toolkit from Intel that converts and optimizes deep learning models, then runs fast inference on CPU, GPU, and NPU hardware.
LMCache	★ 9.4k	A KV-cache layer that stores and shares cached attention state across engines and requests to cut repeated computation.

// Overview

// What it does

// Getting started

Install SGLang

Launch the server

Send a request

// When to use it

// How SGLang compares

Overview

What it does

Getting started

When to use it

How SGLang compares