LMDeploy

Compress, deploy, and serve large language models with the TurboMind and PyTorch engines

github.com/InternLM/lmdeploy★ 7.9k lmdeploy.readthedocs.io

Overview

LMDeploy is an open-source toolkit for compressing, deploying, and serving large language models. It is developed by the MMRazor and MMDeploy teams and is built around two inference backends: TurboMind, a C++/CUDA engine, and a pure-Python PyTorch engine that is easier to extend.

It is aimed at engineers who need to run open models such as Llama, Qwen, InternLM, DeepSeek, and many vision-language models in production rather than just in a notebook. You can load Hugging Face models directly, quantize them to lower precision, and expose them over an OpenAI-compatible HTTP API.

Within the high-throughput serving category, LMDeploy focuses on combining quantization (weight-only and KV cache) with serving features like continuous batching and a blocked KV cache, so the same toolkit handles both shrinking the model and serving requests at scale.

What it does

Two inference engines: the C++ TurboMind engine and a pure-Python PyTorch engine for easier experimentation
Request throughput up to 1.8x higher than vLLM via persistent (continuous) batching, blocked KV cache, dynamic split and fuse, and tensor parallelism
Weight-only and KV cache quantization, including 4-bit AWQ inference reported at 2.4x faster than FP16
OpenAI-compatible API server so existing OpenAI client code can talk to local models
Support for many LLMs and vision-language models (Llama, Qwen, InternLM, DeepSeek, InternVL, and more), loadable directly from Hugging Face
Multi-model, multi-machine, multi-card serving through a request distribution service

Getting started

Install LMDeploy from PyPI, run a quick offline inference check, then launch the API server when you are ready to serve over HTTP. Prebuilt wheels from v0.13.0 onward are built against CUDA 12.8, and Python 3.10-3.13 is supported.

Install LMDeploy

Install the package from PyPI. This pulls in the prebuilt wheel with the TurboMind engine.

bashbash

pip install lmdeploy

Run offline inference

Use the pipeline API to load a model from Hugging Face and run a couple of prompts.

pythonpython

import lmdeploy
with lmdeploy.pipeline("internlm/internlm3-8b-instruct") as pipe:
    response = pipe(["Hi, pls intro yourself", "Shanghai is"])
    print(response)

Serve an OpenAI-compatible API

Launch the API server to expose the model over HTTP. By default it listens on localhost at port 23333; use --server-port to change it.

bashbash

lmdeploy serve api_server internlm/internlm2_5-7b-chat

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Serving an open LLM or VLM behind an OpenAI-compatible endpoint so existing client code can call a self-hosted model
Quantizing a model to 4-bit (AWQ) or applying KV cache quantization to fit it on smaller or fewer GPUs
Running high-throughput batch inference where continuous batching and a blocked KV cache improve requests per second
Deploying multi-model services across multiple machines and GPUs through the request distribution service

How LMDeploy compares

LMDeploy alongside other open-source serving & deployment tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Transformers	★ 162k	Hugging Face Transformers is a Python framework that defines and runs state-of-the-art pretrained models for text, vision, audio, and multimodal tasks, for both inference and training.
vLLM	★ 83.4k	A high-throughput LLM serving engine that uses PagedAttention and continuous batching to serve many requests at once.
SGLang	★ 29.3k	A serving framework for LLMs and multimodal models that boosts throughput by reusing shared prompt prefixes across requests.
TensorRT-LLM	★ 13.9k	NVIDIA's library that compiles LLMs into optimized engines for the fastest inference on its data-center GPUs.
OpenLLM	★ 12.4k	A tool to run any open-source LLM as an OpenAI-compatible API endpoint locally or in the cloud.
NVIDIA Triton Inference Server	★ 10.8k	A multi-framework model server that runs TensorRT, PyTorch, ONNX, and other models with dynamic batching and concurrent execution.
OpenVINO	★ 10.4k	An open-source toolkit from Intel that converts and optimizes deep learning models, then runs fast inference on CPU, GPU, and NPU hardware.
LMDeploy	★ 7.9k	Compress, deploy, and serve large language models with the TurboMind and PyTorch engines

// Overview

// What it does

// Getting started