Overview
NVIDIA Dynamo is an open-source inference stack for serving large language and multimodal models across many GPUs and nodes. It sits above existing inference engines rather than replacing them: it turns vLLM, SGLang, or TensorRT-LLM into a coordinated multi-node system. The core is written in Rust for performance, with a Python layer for extensibility.
It is built for teams that already run a single engine well but need to scale out. Dynamo adds disaggregated serving (separating the prefill and decode phases onto independently scalable GPU pools), KV-aware routing to avoid recomputing prefill, multi-tier KV caching, and an autoscaler that aims to meet latency targets at lower total cost.
As a high-throughput serving layer, Dynamo targets datacenter and Kubernetes deployments. If you are running a single model on a single GPU, the project notes that your inference engine alone is usually enough; Dynamo earns its keep once coordination across GPUs becomes the bottleneck.
What it does
- Disaggregated prefill/decode: separates the two phases into independently scalable GPU pools so each runs on hardware tuned for its workload
- KV-aware routing: routes requests by worker load and KV cache overlap to skip redundant prefill computation
- KV Block Manager (KVBM): offloads KV cache across GPU, CPU, SSD, and remote storage to extend effective context length
- SLA-based Planner: an autoscaler that profiles workloads and right-sizes GPU pools to hit latency targets at lower TCO
- Works as an orchestration layer over SGLang, TensorRT-LLM, and vLLM rather than replacing them
- Fault tolerance with health checks and in-flight request migration so failed workers don't drop user requests
Getting started
Install Dynamo with the extra for your chosen backend, then run the OpenAI-compatible frontend alongside a worker. This example uses the SGLang backend with a small model.
Install Dynamo
Install the package with the backend extra you want. Use [sglang] or [vllm]; for TensorRT-LLM follow the repo's pip instructions with the NVIDIA extra index.
uv pip install --prerelease=allow "ai-dynamo[sglang]"Start the frontend and a worker
Run the HTTP frontend, then start a backend worker pointing at a model. Both use the file discovery backend for a local single-host run.
python3 -m dynamo.frontend --http-port 8000 --discovery-backend file
python3 -m dynamo.sglang --model-path Qwen/Qwen3-0.6B --discovery-backend fileSend a request
Query the OpenAI-compatible chat completions endpoint to confirm the model is serving.
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Serving a large LLM across multiple GPUs or nodes that need to be coordinated as one system
- Running disaggregated serving to scale prefill and decode independently for better GPU utilization
- Using KV-aware routing to cut time to first token on workloads with overlapping prompts
- Autoscaling GPU pools on Kubernetes to meet latency SLAs while controlling total cost of ownership
How NVIDIA Dynamo compares
NVIDIA Dynamo alongside other open-source serving & deployment tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Transformers | ★ 162k | Hugging Face Transformers is a Python framework that defines and runs state-of-the-art pretrained models for text, vision, audio, and multimodal tasks, for both inference and training. |
| vLLM | ★ 83.4k | A high-throughput LLM serving engine that uses PagedAttention and continuous batching to serve many requests at once. |
| SGLang | ★ 29.3k | A serving framework for LLMs and multimodal models that boosts throughput by reusing shared prompt prefixes across requests. |
| TensorRT-LLM | ★ 13.9k | NVIDIA's library that compiles LLMs into optimized engines for the fastest inference on its data-center GPUs. |
| OpenLLM | ★ 12.4k | A tool to run any open-source LLM as an OpenAI-compatible API endpoint locally or in the cloud. |
| NVIDIA Triton Inference Server | ★ 10.8k | A multi-framework model server that runs TensorRT, PyTorch, ONNX, and other models with dynamic batching and concurrent execution. |
| OpenVINO | ★ 10.4k | An open-source toolkit from Intel that converts and optimizes deep learning models, then runs fast inference on CPU, GPU, and NPU hardware. |
| NVIDIA Dynamo | ★ 7.3k | Datacenter-scale inference orchestration for vLLM, SGLang, and TensorRT-LLM |