AI/TLDR

LMDeploy

Compress, deploy, and serve large language models with the TurboMind and PyTorch engines

Overview

LMDeploy is an open-source toolkit for compressing, deploying, and serving large language models. It is developed by the MMRazor and MMDeploy teams and is built around two inference backends: TurboMind, a C++/CUDA engine, and a pure-Python PyTorch engine that is easier to extend.

It is aimed at engineers who need to run open models such as Llama, Qwen, InternLM, DeepSeek, and many vision-language models in production rather than just in a notebook. You can load Hugging Face models directly, quantize them to lower precision, and expose them over an OpenAI-compatible HTTP API.

Within the high-throughput serving category, LMDeploy focuses on combining quantization (weight-only and KV cache) with serving features like continuous batching and a blocked KV cache, so the same toolkit handles both shrinking the model and serving requests at scale.

What it does

  • Two inference engines: the C++ TurboMind engine and a pure-Python PyTorch engine for easier experimentation
  • Request throughput up to 1.8x higher than vLLM via persistent (continuous) batching, blocked KV cache, dynamic split and fuse, and tensor parallelism
  • Weight-only and KV cache quantization, including 4-bit AWQ inference reported at 2.4x faster than FP16
  • OpenAI-compatible API server so existing OpenAI client code can talk to local models
  • Support for many LLMs and vision-language models (Llama, Qwen, InternLM, DeepSeek, InternVL, and more), loadable directly from Hugging Face
  • Multi-model, multi-machine, multi-card serving through a request distribution service

Getting started

Install LMDeploy from PyPI, run a quick offline inference check, then launch the API server when you are ready to serve over HTTP. Prebuilt wheels from v0.13.0 onward are built against CUDA 12.8, and Python 3.10-3.13 is supported.

Install LMDeploy

Install the package from PyPI. This pulls in the prebuilt wheel with the TurboMind engine.

bashbash
pip install lmdeploy

Run offline inference

Use the pipeline API to load a model from Hugging Face and run a couple of prompts.

pythonpython
import lmdeploy
with lmdeploy.pipeline("internlm/internlm3-8b-instruct") as pipe:
    response = pipe(["Hi, pls intro yourself", "Shanghai is"])
    print(response)

Serve an OpenAI-compatible API

Launch the API server to expose the model over HTTP. By default it listens on localhost at port 23333; use --server-port to change it.

bashbash
lmdeploy serve api_server internlm/internlm2_5-7b-chat

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Serving an open LLM or VLM behind an OpenAI-compatible endpoint so existing client code can call a self-hosted model
  • Quantizing a model to 4-bit (AWQ) or applying KV cache quantization to fit it on smaller or fewer GPUs
  • Running high-throughput batch inference where continuous batching and a blocked KV cache improve requests per second
  • Deploying multi-model services across multiple machines and GPUs through the request distribution service

How LMDeploy compares

LMDeploy alongside other open-source serving & deployment tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Transformers★ 162kHugging Face Transformers is a Python framework that defines and runs state-of-the-art pretrained models for text, vision, audio, and multimodal tasks, for both inference and training.
vLLM★ 83.4kA high-throughput LLM serving engine that uses PagedAttention and continuous batching to serve many requests at once.
SGLang★ 29.3kA serving framework for LLMs and multimodal models that boosts throughput by reusing shared prompt prefixes across requests.
TensorRT-LLM★ 13.9kNVIDIA's library that compiles LLMs into optimized engines for the fastest inference on its data-center GPUs.
OpenLLM★ 12.4kA tool to run any open-source LLM as an OpenAI-compatible API endpoint locally or in the cloud.
NVIDIA Triton Inference Server★ 10.8kA multi-framework model server that runs TensorRT, PyTorch, ONNX, and other models with dynamic batching and concurrent execution.
OpenVINO★ 10.4kAn open-source toolkit from Intel that converts and optimizes deep learning models, then runs fast inference on CPU, GPU, and NPU hardware.
LMDeploy★ 7.9kCompress, deploy, and serve large language models with the TurboMind and PyTorch engines