Overview
LMDeploy is an open-source toolkit for compressing, deploying, and serving large language models. It is developed by the MMRazor and MMDeploy teams and is built around two inference backends: TurboMind, a C++/CUDA engine, and a pure-Python PyTorch engine that is easier to extend.
It is aimed at engineers who need to run open models such as Llama, Qwen, InternLM, DeepSeek, and many vision-language models in production rather than just in a notebook. You can load Hugging Face models directly, quantize them to lower precision, and expose them over an OpenAI-compatible HTTP API.
Within the high-throughput serving category, LMDeploy focuses on combining quantization (weight-only and KV cache) with serving features like continuous batching and a blocked KV cache, so the same toolkit handles both shrinking the model and serving requests at scale.
What it does
- Two inference engines: the C++ TurboMind engine and a pure-Python PyTorch engine for easier experimentation
- Request throughput up to 1.8x higher than vLLM via persistent (continuous) batching, blocked KV cache, dynamic split and fuse, and tensor parallelism
- Weight-only and KV cache quantization, including 4-bit AWQ inference reported at 2.4x faster than FP16
- OpenAI-compatible API server so existing OpenAI client code can talk to local models
- Support for many LLMs and vision-language models (Llama, Qwen, InternLM, DeepSeek, InternVL, and more), loadable directly from Hugging Face
- Multi-model, multi-machine, multi-card serving through a request distribution service
Getting started
Install LMDeploy from PyPI, run a quick offline inference check, then launch the API server when you are ready to serve over HTTP. Prebuilt wheels from v0.13.0 onward are built against CUDA 12.8, and Python 3.10-3.13 is supported.
Install LMDeploy
Install the package from PyPI. This pulls in the prebuilt wheel with the TurboMind engine.
pip install lmdeployRun offline inference
Use the pipeline API to load a model from Hugging Face and run a couple of prompts.
import lmdeploy
with lmdeploy.pipeline("internlm/internlm3-8b-instruct") as pipe:
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)Serve an OpenAI-compatible API
Launch the API server to expose the model over HTTP. By default it listens on localhost at port 23333; use --server-port to change it.
lmdeploy serve api_server internlm/internlm2_5-7b-chatCommands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Serving an open LLM or VLM behind an OpenAI-compatible endpoint so existing client code can call a self-hosted model
- Quantizing a model to 4-bit (AWQ) or applying KV cache quantization to fit it on smaller or fewer GPUs
- Running high-throughput batch inference where continuous batching and a blocked KV cache improve requests per second
- Deploying multi-model services across multiple machines and GPUs through the request distribution service
How LMDeploy compares
LMDeploy alongside other open-source serving & deployment tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Transformers | ★ 162k | Hugging Face Transformers is a Python framework that defines and runs state-of-the-art pretrained models for text, vision, audio, and multimodal tasks, for both inference and training. |
| vLLM | ★ 83.4k | A high-throughput LLM serving engine that uses PagedAttention and continuous batching to serve many requests at once. |
| SGLang | ★ 29.3k | A serving framework for LLMs and multimodal models that boosts throughput by reusing shared prompt prefixes across requests. |
| TensorRT-LLM | ★ 13.9k | NVIDIA's library that compiles LLMs into optimized engines for the fastest inference on its data-center GPUs. |
| OpenLLM | ★ 12.4k | A tool to run any open-source LLM as an OpenAI-compatible API endpoint locally or in the cloud. |
| NVIDIA Triton Inference Server | ★ 10.8k | A multi-framework model server that runs TensorRT, PyTorch, ONNX, and other models with dynamic batching and concurrent execution. |
| OpenVINO | ★ 10.4k | An open-source toolkit from Intel that converts and optimizes deep learning models, then runs fast inference on CPU, GPU, and NPU hardware. |
| LMDeploy | ★ 7.9k | Compress, deploy, and serve large language models with the TurboMind and PyTorch engines |