Overview
SGLang is a serving framework for large language models and multimodal models. It runs your model behind an HTTP server and answers requests through an OpenAI-compatible API, so existing clients can talk to it with little change. It is built to give low-latency, high-throughput inference from a single GPU up to large distributed clusters.
Its main idea is to reuse work across requests. RadixAttention caches shared prompt prefixes, so when many requests start with the same system prompt or few-shot examples, the engine does not recompute that part each time. Combined with continuous batching and a low-overhead scheduler, this raises the number of tokens a server can produce per second.
It fits the high-throughput serving category alongside engines you put in front of a model in production. It supports a wide range of language models (Llama, Qwen, DeepSeek, GLM, Gemma, Mistral and more), embedding models, reward models, and diffusion models, and offers quantization, tensor and pipeline parallelism, structured outputs, and multi-LoRA batching.
What it does
- RadixAttention prefix caching reuses shared prompt prefixes across requests to cut repeated computation
- OpenAI-compatible HTTP server, so existing OpenAI client code can point at your local endpoint
- Continuous batching, paged attention, chunked prefill, and a zero-overhead CPU scheduler for higher throughput
- Tensor, pipeline, expert, and data parallelism to scale from one GPU to multi-GPU clusters
- Quantization support (FP8, FP4, INT4, AWQ, GPTQ) plus multi-LoRA batching
- Broad model coverage: language, embedding, reward, and diffusion models including Llama, Qwen, DeepSeek, GLM, Gemma, and Mistral
Getting started
Install SGLang, launch a server for a model, then send requests to its OpenAI-compatible endpoint on port 30000.
Install SGLang
The docs recommend installing with uv for a faster setup.
pip install --upgrade pip
pip install uv
uv pip install sglangLaunch the server
Start the server with a model path. By default it listens on port 30000.
python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct \
--host 0.0.0.0 --log-level warningSend a request
Query the OpenAI-compatible chat endpoint with curl. The base URL is http://localhost:30000/v1.
curl -s http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "qwen/qwen2.5-0.5b-instruct", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Serve an open model behind an OpenAI-compatible API so existing app code works without rewriting client calls
- Run chat or agent workloads where many requests share a long system prompt, letting prefix caching skip repeated work
- Scale inference across multiple GPUs with tensor or pipeline parallelism for large models like DeepSeek or Qwen
- Serve multiple LoRA adapters in one batch, or run quantized models to fit larger workloads on limited GPU memory
How SGLang compares
SGLang alongside other open-source serving & deployment tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Transformers | ★ 162k | Hugging Face Transformers is a Python framework that defines and runs state-of-the-art pretrained models for text, vision, audio, and multimodal tasks, for both inference and training. |
| vLLM | ★ 83.4k | A high-throughput LLM serving engine that uses PagedAttention and continuous batching to serve many requests at once. |
| SGLang | ★ 29.3k | Fast serving for LLMs and multimodal models with prefix-cache reuse |
| TensorRT-LLM | ★ 13.9k | NVIDIA's library that compiles LLMs into optimized engines for the fastest inference on its data-center GPUs. |
| OpenLLM | ★ 12.4k | A tool to run any open-source LLM as an OpenAI-compatible API endpoint locally or in the cloud. |
| NVIDIA Triton Inference Server | ★ 10.8k | A multi-framework model server that runs TensorRT, PyTorch, ONNX, and other models with dynamic batching and concurrent execution. |
| OpenVINO | ★ 10.4k | An open-source toolkit from Intel that converts and optimizes deep learning models, then runs fast inference on CPU, GPU, and NPU hardware. |
| LMCache | ★ 9.4k | A KV-cache layer that stores and shares cached attention state across engines and requests to cut repeated computation. |