NVIDIA Triton Inference Server

Open-source server for deploying AI models from any framework on GPU or CPU

github.com/triton-inference-server/server★ 10.8k developer.nvidia.com/triton-inference-server

Overview

NVIDIA Triton Inference Server is open-source inference serving software that runs AI models from many frameworks behind one server. It supports TensorRT, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more, so you can serve mixed model types without standing up a separate service for each.

It is built for teams that need to put trained models into production. Triton runs across cloud, data center, edge, and embedded devices on NVIDIA GPUs, x86 and ARM CPUs, or AWS Inferentia, and handles real-time, batched, ensemble, and audio/video streaming requests.

As a model-serving tool, Triton sits between your trained models and the applications that call them. It exposes HTTP/REST and gRPC endpoints based on the KServe protocol, plus C and Java APIs for in-process use, and reports metrics on GPU utilization, throughput, and latency.

What it does

Serves models from multiple frameworks: TensorRT, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and others
Concurrent model execution and dynamic batching to raise throughput
Sequence batching and implicit state management for stateful models
HTTP/REST and gRPC inference protocols based on the community KServe v2 protocol
Backend API for custom backends and pre/post-processing, including Python-based backends
Model pipelines via Ensembling or Business Logic Scripting (BLS), plus GPU/throughput/latency metrics

Getting started

The recommended way to run Triton is with the prebuilt NGC Docker container. The README shows a three-step example that serves a sample ONNX model and sends an inference request.

Create the example model repository

Clone the release branch and run the helper script that downloads the example models, including densenet_onnx.

bashbash

git clone -b r26.05 https://github.com/triton-inference-server/server.git
cd server/docs/examples
./fetch_models.sh

Launch Triton from the NGC container

Run the Triton container with your model repository mounted, loading the densenet_onnx model explicitly.

bashbash

docker run --gpus=1 --rm --net=host -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:26.05-py3 tritonserver --model-repository=/models --model-control-mode explicit --load-model densenet_onnx

Send an inference request

In a separate console, use the image_client example from the Triton SDK container to classify a sample image. It should return COFFEE MUG, CUP, and COFFEEPOT.

bashbash

docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:26.05-py3-sdk /workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg

Read the QuickStart guide

For CPU-only systems and more details, follow the QuickStart guide in the repo at docs/getting_started/quickstart.md.

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Serving a mix of TensorRT, PyTorch, and ONNX models from a single production endpoint
Maximizing GPU throughput for real-time or batched inference with dynamic batching and concurrent execution
Running multi-step model pipelines with Ensembling or Business Logic Scripting
Embedding inference directly into an application at the edge using the C or Java API

How NVIDIA Triton Inference Server compares

NVIDIA Triton Inference Server alongside other open-source serving & deployment tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Transformers	★ 162k	Hugging Face Transformers is a Python framework that defines and runs state-of-the-art pretrained models for text, vision, audio, and multimodal tasks, for both inference and training.
vLLM	★ 83.4k	A high-throughput LLM serving engine that uses PagedAttention and continuous batching to serve many requests at once.
SGLang	★ 29.3k	A serving framework for LLMs and multimodal models that boosts throughput by reusing shared prompt prefixes across requests.
TensorRT-LLM	★ 13.9k	NVIDIA's library that compiles LLMs into optimized engines for the fastest inference on its data-center GPUs.
OpenLLM	★ 12.4k	A tool to run any open-source LLM as an OpenAI-compatible API endpoint locally or in the cloud.
NVIDIA Triton Inference Server	★ 10.8k	Open-source server for deploying AI models from any framework on GPU or CPU
OpenVINO	★ 10.4k	An open-source toolkit from Intel that converts and optimizes deep learning models, then runs fast inference on CPU, GPU, and NPU hardware.
LMCache	★ 9.4k	A KV-cache layer that stores and shares cached attention state across engines and requests to cut repeated computation.

// Overview

// What it does

// Getting started

Create the example model repository

Launch Triton from the NGC container

Send an inference request

Read the QuickStart guide

// When to use it

// How NVIDIA Triton Inference Server compares

Overview

What it does

Getting started

When to use it

How NVIDIA Triton Inference Server compares