AI/TLDR

LMCache

A KV cache layer that stores and reuses attention state to cut repeated LLM prefill

Overview

LMCache is a KV cache management layer for LLM inference. Instead of treating the key-value attention cache as throwaway state, it stores that cache so it can be reused across requests, sessions, and serving engines. By skipping prefill computation that has already been done, it reduces time-to-first-token (TTFT) and raises throughput.

It is aimed at teams running production LLM serving who handle long-context, multi-turn, agentic, or RAG workloads where the same context shows up again and again. It runs as a standalone daemon, so cached KV data survives even if the inference engine process crashes.

As a high-throughput serving component, LMCache is vendor-neutral. It plugs into mainstream open-source serving engines (such as vLLM) and a range of storage backends, so you can move KV cache out of GPU memory into CPU RAM, local disk, or remote stores and switch vendors while keeping your cached data.

What it does

  • Engine-independent daemon: manages KV cache in a separate process, so the cache is not lost if the inference engine crashes
  • Tiered KV cache offloading: moves cache from GPU memory into CPU RAM, local SSD, and remote backends for reuse across requests and instances
  • Pluggable storage backends: CPU RAM, local disk, Redis/Valkey, Mooncake, InfiniStore, S3-compatible object storage, NIXL, and GDS
  • Non-prefix KV reuse via CacheBlend: reuse cached blocks at any position in the prompt, not just shared prefixes
  • PD disaggregation and KV transfer from prefill to decode workers over NVLink, RDMA, or TCP
  • KV cache observability: request- and token-level prefix cache hit metrics, lifecycle, and Kubernetes-style health and performance metrics

Getting started

Install LMCache from pip, then run it as a KV cache layer in front of a serving engine such as vLLM. The example below uses the recommended multiprocess (MP) mode.

Install LMCache

Install the lmcache package with pip. The docs recommend installing it alongside vLLM in a fresh Python 3.12 virtual environment.

bashbash
pip install lmcache

Start the LMCache server

Launch the standalone LMCache server, which manages the KV cache independently of the inference engine.

bashbash
lmcache server \
    --l1-size-gb 20 --eviction-policy LRU --chunk-size 16

Serve a model with the LMCache connector

Start vLLM and point it at LMCache through the MP connector using the kv-transfer-config flag.

bashbash
vllm serve Qwen/Qwen3-8B \
    --port 8000 --kv-transfer-config \
    '{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both"}'

Send a request

Call the OpenAI-compatible completions endpoint; repeated context will hit the cached KV state.

bashbash
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen/Qwen3-8B", "prompt": "Your prompt", "max_tokens": 100}'

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Serving long-context, multi-turn chat where earlier conversation turns are re-sent on every request and would otherwise be re-prefilled
  • RAG and knowledge-augmented workloads where the same retrieved documents recur across queries
  • Agentic workloads that repeatedly feed long shared context, tools, or system prompts into the model
  • Disaggregated prefill/decode deployments that need to transfer KV cache between workers over NVLink, RDMA, or TCP

How LMCache compares

LMCache alongside other open-source serving & deployment tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Transformers★ 162kHugging Face Transformers is a Python framework that defines and runs state-of-the-art pretrained models for text, vision, audio, and multimodal tasks, for both inference and training.
vLLM★ 83.4kA high-throughput LLM serving engine that uses PagedAttention and continuous batching to serve many requests at once.
SGLang★ 29.3kA serving framework for LLMs and multimodal models that boosts throughput by reusing shared prompt prefixes across requests.
TensorRT-LLM★ 13.9kNVIDIA's library that compiles LLMs into optimized engines for the fastest inference on its data-center GPUs.
OpenLLM★ 12.4kA tool to run any open-source LLM as an OpenAI-compatible API endpoint locally or in the cloud.
NVIDIA Triton Inference Server★ 10.8kA multi-framework model server that runs TensorRT, PyTorch, ONNX, and other models with dynamic batching and concurrent execution.
OpenVINO★ 10.4kAn open-source toolkit from Intel that converts and optimizes deep learning models, then runs fast inference on CPU, GPU, and NPU hardware.
LMCache★ 9.4kA KV cache layer that stores and reuses attention state to cut repeated LLM prefill