AI/TLDR

NVIDIA Dynamo

Datacenter-scale inference orchestration for vLLM, SGLang, and TensorRT-LLM

Overview

NVIDIA Dynamo is an open-source inference stack for serving large language and multimodal models across many GPUs and nodes. It sits above existing inference engines rather than replacing them: it turns vLLM, SGLang, or TensorRT-LLM into a coordinated multi-node system. The core is written in Rust for performance, with a Python layer for extensibility.

It is built for teams that already run a single engine well but need to scale out. Dynamo adds disaggregated serving (separating the prefill and decode phases onto independently scalable GPU pools), KV-aware routing to avoid recomputing prefill, multi-tier KV caching, and an autoscaler that aims to meet latency targets at lower total cost.

As a high-throughput serving layer, Dynamo targets datacenter and Kubernetes deployments. If you are running a single model on a single GPU, the project notes that your inference engine alone is usually enough; Dynamo earns its keep once coordination across GPUs becomes the bottleneck.

What it does

  • Disaggregated prefill/decode: separates the two phases into independently scalable GPU pools so each runs on hardware tuned for its workload
  • KV-aware routing: routes requests by worker load and KV cache overlap to skip redundant prefill computation
  • KV Block Manager (KVBM): offloads KV cache across GPU, CPU, SSD, and remote storage to extend effective context length
  • SLA-based Planner: an autoscaler that profiles workloads and right-sizes GPU pools to hit latency targets at lower TCO
  • Works as an orchestration layer over SGLang, TensorRT-LLM, and vLLM rather than replacing them
  • Fault tolerance with health checks and in-flight request migration so failed workers don't drop user requests

Getting started

Install Dynamo with the extra for your chosen backend, then run the OpenAI-compatible frontend alongside a worker. This example uses the SGLang backend with a small model.

Install Dynamo

Install the package with the backend extra you want. Use [sglang] or [vllm]; for TensorRT-LLM follow the repo's pip instructions with the NVIDIA extra index.

bashbash
uv pip install --prerelease=allow "ai-dynamo[sglang]"

Start the frontend and a worker

Run the HTTP frontend, then start a backend worker pointing at a model. Both use the file discovery backend for a local single-host run.

bashbash
python3 -m dynamo.frontend --http-port 8000 --discovery-backend file
python3 -m dynamo.sglang --model-path Qwen/Qwen3-0.6B --discovery-backend file

Send a request

Query the OpenAI-compatible chat completions endpoint to confirm the model is serving.

bashbash
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "Qwen/Qwen3-0.6B",
  "messages": [{"role": "user", "content": "Hello!"}],
  "max_tokens": 100
}'

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Serving a large LLM across multiple GPUs or nodes that need to be coordinated as one system
  • Running disaggregated serving to scale prefill and decode independently for better GPU utilization
  • Using KV-aware routing to cut time to first token on workloads with overlapping prompts
  • Autoscaling GPU pools on Kubernetes to meet latency SLAs while controlling total cost of ownership

How NVIDIA Dynamo compares

NVIDIA Dynamo alongside other open-source serving & deployment tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Transformers★ 162kHugging Face Transformers is a Python framework that defines and runs state-of-the-art pretrained models for text, vision, audio, and multimodal tasks, for both inference and training.
vLLM★ 83.4kA high-throughput LLM serving engine that uses PagedAttention and continuous batching to serve many requests at once.
SGLang★ 29.3kA serving framework for LLMs and multimodal models that boosts throughput by reusing shared prompt prefixes across requests.
TensorRT-LLM★ 13.9kNVIDIA's library that compiles LLMs into optimized engines for the fastest inference on its data-center GPUs.
OpenLLM★ 12.4kA tool to run any open-source LLM as an OpenAI-compatible API endpoint locally or in the cloud.
NVIDIA Triton Inference Server★ 10.8kA multi-framework model server that runs TensorRT, PyTorch, ONNX, and other models with dynamic batching and concurrent execution.
OpenVINO★ 10.4kAn open-source toolkit from Intel that converts and optimizes deep learning models, then runs fast inference on CPU, GPU, and NPU hardware.
NVIDIA Dynamo★ 7.3kDatacenter-scale inference orchestration for vLLM, SGLang, and TensorRT-LLM