NVIDIA Dynamo

Datacenter-scale inference orchestration for vLLM, SGLang, and TensorRT-LLM

Overview

NVIDIA Dynamo is an open-source inference stack for serving large language and multimodal models across many GPUs and nodes. It sits above existing inference engines rather than replacing them: it turns vLLM, SGLang, or TensorRT-LLM into a coordinated multi-node system. The core is written in Rust for performance, with a Python layer for extensibility.

It is built for teams that already run a single engine well but need to scale out. Dynamo adds disaggregated serving (separating the prefill and decode phases onto independently scalable GPU pools), KV-aware routing to avoid recomputing prefill, multi-tier KV caching, and an autoscaler that aims to meet latency targets at lower total cost.

As a high-throughput serving layer, Dynamo targets datacenter and Kubernetes deployments. If you are running a single model on a single GPU, the project notes that your inference engine alone is usually enough; Dynamo earns its keep once coordination across GPUs becomes the bottleneck.

What it does

Disaggregated prefill/decode: separates the two phases into independently scalable GPU pools so each runs on hardware tuned for its workload
KV-aware routing: routes requests by worker load and KV cache overlap to skip redundant prefill computation
KV Block Manager (KVBM): offloads KV cache across GPU, CPU, SSD, and remote storage to extend effective context length
SLA-based Planner: an autoscaler that profiles workloads and right-sizes GPU pools to hit latency targets at lower TCO
Works as an orchestration layer over SGLang, TensorRT-LLM, and vLLM rather than replacing them
Fault tolerance with health checks and in-flight request migration so failed workers don't drop user requests

Getting started

Install Dynamo with the extra for your chosen backend, then run the OpenAI-compatible frontend alongside a worker. This example uses the SGLang backend with a small model.

Install Dynamo

Install the package with the backend extra you want. Use [sglang] or [vllm]; for TensorRT-LLM follow the repo's pip instructions with the NVIDIA extra index.

bashbash

uv pip install --prerelease=allow "ai-dynamo[sglang]"

Start the frontend and a worker

Run the HTTP frontend, then start a backend worker pointing at a model. Both use the file discovery backend for a local single-host run.

bashbash

python3 -m dynamo.frontend --http-port 8000 --discovery-backend file
python3 -m dynamo.sglang --model-path Qwen/Qwen3-0.6B --discovery-backend file

Send a request

Query the OpenAI-compatible chat completions endpoint to confirm the model is serving.

bashbash

curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "Qwen/Qwen3-0.6B",
  "messages": [{"role": "user", "content": "Hello!"}],
  "max_tokens": 100
}'

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Serving a large LLM across multiple GPUs or nodes that need to be coordinated as one system
Running disaggregated serving to scale prefill and decode independently for better GPU utilization
Using KV-aware routing to cut time to first token on workloads with overlapping prompts
Autoscaling GPU pools on Kubernetes to meet latency SLAs while controlling total cost of ownership

How NVIDIA Dynamo compares

NVIDIA Dynamo alongside other open-source serving & deployment tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Transformers	★ 162k	Hugging Face Transformers is a Python framework that defines and runs state-of-the-art pretrained models for text, vision, audio, and multimodal tasks, for both inference and training.
vLLM	★ 83.4k	A high-throughput LLM serving engine that uses PagedAttention and continuous batching to serve many requests at once.
SGLang	★ 29.3k	A serving framework for LLMs and multimodal models that boosts throughput by reusing shared prompt prefixes across requests.
TensorRT-LLM	★ 13.9k	NVIDIA's library that compiles LLMs into optimized engines for the fastest inference on its data-center GPUs.
OpenLLM	★ 12.4k	A tool to run any open-source LLM as an OpenAI-compatible API endpoint locally or in the cloud.
NVIDIA Triton Inference Server	★ 10.8k	A multi-framework model server that runs TensorRT, PyTorch, ONNX, and other models with dynamic batching and concurrent execution.
OpenVINO	★ 10.4k	An open-source toolkit from Intel that converts and optimizes deep learning models, then runs fast inference on CPU, GPU, and NPU hardware.
NVIDIA Dynamo	★ 7.3k	Datacenter-scale inference orchestration for vLLM, SGLang, and TensorRT-LLM

// Overview

// What it does

// Getting started

Install Dynamo

Start the frontend and a worker

Send a request

// When to use it

// How NVIDIA Dynamo compares

Overview

What it does

Getting started

When to use it

How NVIDIA Dynamo compares