LMCache

A KV cache layer that stores and reuses attention state to cut repeated LLM prefill

github.com/LMCache/LMCache★ 9.4k lmcache.ai

Overview

LMCache is a KV cache management layer for LLM inference. Instead of treating the key-value attention cache as throwaway state, it stores that cache so it can be reused across requests, sessions, and serving engines. By skipping prefill computation that has already been done, it reduces time-to-first-token (TTFT) and raises throughput.

It is aimed at teams running production LLM serving who handle long-context, multi-turn, agentic, or RAG workloads where the same context shows up again and again. It runs as a standalone daemon, so cached KV data survives even if the inference engine process crashes.

As a high-throughput serving component, LMCache is vendor-neutral. It plugs into mainstream open-source serving engines (such as vLLM) and a range of storage backends, so you can move KV cache out of GPU memory into CPU RAM, local disk, or remote stores and switch vendors while keeping your cached data.

What it does

Engine-independent daemon: manages KV cache in a separate process, so the cache is not lost if the inference engine crashes
Tiered KV cache offloading: moves cache from GPU memory into CPU RAM, local SSD, and remote backends for reuse across requests and instances
Pluggable storage backends: CPU RAM, local disk, Redis/Valkey, Mooncake, InfiniStore, S3-compatible object storage, NIXL, and GDS
Non-prefix KV reuse via CacheBlend: reuse cached blocks at any position in the prompt, not just shared prefixes
PD disaggregation and KV transfer from prefill to decode workers over NVLink, RDMA, or TCP
KV cache observability: request- and token-level prefix cache hit metrics, lifecycle, and Kubernetes-style health and performance metrics

Getting started

Install LMCache from pip, then run it as a KV cache layer in front of a serving engine such as vLLM. The example below uses the recommended multiprocess (MP) mode.

Install LMCache

Install the lmcache package with pip. The docs recommend installing it alongside vLLM in a fresh Python 3.12 virtual environment.

bashbash

pip install lmcache

Start the LMCache server

Launch the standalone LMCache server, which manages the KV cache independently of the inference engine.

bashbash

lmcache server \
    --l1-size-gb 20 --eviction-policy LRU --chunk-size 16

Serve a model with the LMCache connector

Start vLLM and point it at LMCache through the MP connector using the kv-transfer-config flag.

bashbash

vllm serve Qwen/Qwen3-8B \
    --port 8000 --kv-transfer-config \
    '{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both"}'

Send a request

Call the OpenAI-compatible completions endpoint; repeated context will hit the cached KV state.

bashbash

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen/Qwen3-8B", "prompt": "Your prompt", "max_tokens": 100}'

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Serving long-context, multi-turn chat where earlier conversation turns are re-sent on every request and would otherwise be re-prefilled
RAG and knowledge-augmented workloads where the same retrieved documents recur across queries
Agentic workloads that repeatedly feed long shared context, tools, or system prompts into the model
Disaggregated prefill/decode deployments that need to transfer KV cache between workers over NVLink, RDMA, or TCP

How LMCache compares

LMCache alongside other open-source serving & deployment tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Transformers	★ 162k	Hugging Face Transformers is a Python framework that defines and runs state-of-the-art pretrained models for text, vision, audio, and multimodal tasks, for both inference and training.
vLLM	★ 83.4k	A high-throughput LLM serving engine that uses PagedAttention and continuous batching to serve many requests at once.
SGLang	★ 29.3k	A serving framework for LLMs and multimodal models that boosts throughput by reusing shared prompt prefixes across requests.
TensorRT-LLM	★ 13.9k	NVIDIA's library that compiles LLMs into optimized engines for the fastest inference on its data-center GPUs.
OpenLLM	★ 12.4k	A tool to run any open-source LLM as an OpenAI-compatible API endpoint locally or in the cloud.
NVIDIA Triton Inference Server	★ 10.8k	A multi-framework model server that runs TensorRT, PyTorch, ONNX, and other models with dynamic batching and concurrent execution.
OpenVINO	★ 10.4k	An open-source toolkit from Intel that converts and optimizes deep learning models, then runs fast inference on CPU, GPU, and NPU hardware.
LMCache	★ 9.4k	A KV cache layer that stores and reuses attention state to cut repeated LLM prefill

// Overview

// What it does

// Getting started

Install LMCache

Start the LMCache server

Serve a model with the LMCache connector

Send a request

// When to use it

// How LMCache compares

Overview

What it does

Getting started

When to use it

How LMCache compares