AI/TLDR

NVIDIA Triton Inference Server

Open-source server for deploying AI models from any framework on GPU or CPU

Overview

NVIDIA Triton Inference Server is open-source inference serving software that runs AI models from many frameworks behind one server. It supports TensorRT, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more, so you can serve mixed model types without standing up a separate service for each.

It is built for teams that need to put trained models into production. Triton runs across cloud, data center, edge, and embedded devices on NVIDIA GPUs, x86 and ARM CPUs, or AWS Inferentia, and handles real-time, batched, ensemble, and audio/video streaming requests.

As a model-serving tool, Triton sits between your trained models and the applications that call them. It exposes HTTP/REST and gRPC endpoints based on the KServe protocol, plus C and Java APIs for in-process use, and reports metrics on GPU utilization, throughput, and latency.

What it does

  • Serves models from multiple frameworks: TensorRT, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and others
  • Concurrent model execution and dynamic batching to raise throughput
  • Sequence batching and implicit state management for stateful models
  • HTTP/REST and gRPC inference protocols based on the community KServe v2 protocol
  • Backend API for custom backends and pre/post-processing, including Python-based backends
  • Model pipelines via Ensembling or Business Logic Scripting (BLS), plus GPU/throughput/latency metrics

Getting started

The recommended way to run Triton is with the prebuilt NGC Docker container. The README shows a three-step example that serves a sample ONNX model and sends an inference request.

Create the example model repository

Clone the release branch and run the helper script that downloads the example models, including densenet_onnx.

bashbash
git clone -b r26.05 https://github.com/triton-inference-server/server.git
cd server/docs/examples
./fetch_models.sh

Launch Triton from the NGC container

Run the Triton container with your model repository mounted, loading the densenet_onnx model explicitly.

bashbash
docker run --gpus=1 --rm --net=host -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:26.05-py3 tritonserver --model-repository=/models --model-control-mode explicit --load-model densenet_onnx

Send an inference request

In a separate console, use the image_client example from the Triton SDK container to classify a sample image. It should return COFFEE MUG, CUP, and COFFEEPOT.

bashbash
docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:26.05-py3-sdk /workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg

Read the QuickStart guide

For CPU-only systems and more details, follow the QuickStart guide in the repo at docs/getting_started/quickstart.md.

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Serving a mix of TensorRT, PyTorch, and ONNX models from a single production endpoint
  • Maximizing GPU throughput for real-time or batched inference with dynamic batching and concurrent execution
  • Running multi-step model pipelines with Ensembling or Business Logic Scripting
  • Embedding inference directly into an application at the edge using the C or Java API

How NVIDIA Triton Inference Server compares

NVIDIA Triton Inference Server alongside other open-source serving & deployment tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Transformers★ 162kHugging Face Transformers is a Python framework that defines and runs state-of-the-art pretrained models for text, vision, audio, and multimodal tasks, for both inference and training.
vLLM★ 83.4kA high-throughput LLM serving engine that uses PagedAttention and continuous batching to serve many requests at once.
SGLang★ 29.3kA serving framework for LLMs and multimodal models that boosts throughput by reusing shared prompt prefixes across requests.
TensorRT-LLM★ 13.9kNVIDIA's library that compiles LLMs into optimized engines for the fastest inference on its data-center GPUs.
OpenLLM★ 12.4kA tool to run any open-source LLM as an OpenAI-compatible API endpoint locally or in the cloud.
NVIDIA Triton Inference Server★ 10.8kOpen-source server for deploying AI models from any framework on GPU or CPU
OpenVINO★ 10.4kAn open-source toolkit from Intel that converts and optimizes deep learning models, then runs fast inference on CPU, GPU, and NPU hardware.
LMCache★ 9.4kA KV-cache layer that stores and shares cached attention state across engines and requests to cut repeated computation.