TensorRT-LLM

Compile LLMs into optimized engines for fast inference on NVIDIA GPUs

github.com/NVIDIA/TensorRT-LLM★ 13.9k nvidia.github.io/TensorRT-LLM

Overview

TensorRT-LLM is NVIDIA's open-source library for running large language models and visual generation models on NVIDIA GPUs. It optimizes inference with specialized kernels for common operations, an efficient runtime, and a Python framework you can use to customize and extend the system.

It is built for teams that serve models on NVIDIA data-center GPUs and want lower latency and higher throughput than a generic runtime gives. You write against a Python LLM API, and TensorRT-LLM handles the work of turning the model into an optimized engine for the target hardware.

Within the inference and serving space, it sits in the high-throughput serving layer. You can call it directly from Python for batch generation, or start an OpenAI-compatible HTTP server with the bundled trtllm-serve command to put a model behind an endpoint.

What it does

Specialized GPU kernels for common LLM operations, plus an efficient inference runtime
Python LLM API for generation with configurable sampling parameters
trtllm-serve command that exposes a model over an HTTP endpoint
Support for recent open-weight models, including day-0 support for several new releases
Support for visual generation (diffusion) models in addition to text LLMs
Apache-2.0 licensed and developed in the open on GitHub

Getting started

Install the prerequisites and the tensorrt_llm wheel, then run a model through the Python LLM API or the trtllm-serve command. A recent NVIDIA GPU with a matching CUDA Toolkit is required.

Install system prerequisites

On Linux, install OpenMPI before the pip packages.

bashbash

sudo apt-get -y install libopenmpi-dev

Install PyTorch and TensorRT-LLM

Install a matching PyTorch build, then the tensorrt_llm wheel. The wheel requires a compatible CUDA Toolkit with CUDA_HOME configured.

bashbash

pip3 install torch==2.10.0 torchvision --index-url https://download.pytorch.org/whl/cu130
pip3 install --ignore-installed pip setuptools wheel && pip3 install tensorrt_llm

Generate text with the Python API

Load a model with the LLM class and call generate with your sampling parameters.

pythonpython

from tensorrt_llm import LLM, SamplingParams

llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
prompts = ["Hello, my name is"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

for output in llm.generate(prompts, sampling_params):
    print(f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}")

Serve a model over HTTP

Use trtllm-serve to put a model behind an endpoint.

bashbash

trtllm-serve "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Serving an open-weight LLM behind an HTTP endpoint on NVIDIA data-center GPUs
Lowering latency and raising throughput for a model already running on a generic runtime
Running batch text generation from Python with custom sampling settings
Serving diffusion-based visual generation models alongside text LLMs on the same stack

How TensorRT-LLM compares

TensorRT-LLM alongside other open-source serving & deployment tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Transformers	★ 162k	Hugging Face Transformers is a Python framework that defines and runs state-of-the-art pretrained models for text, vision, audio, and multimodal tasks, for both inference and training.
vLLM	★ 83.4k	A high-throughput LLM serving engine that uses PagedAttention and continuous batching to serve many requests at once.
SGLang	★ 29.3k	A serving framework for LLMs and multimodal models that boosts throughput by reusing shared prompt prefixes across requests.
TensorRT-LLM	★ 13.9k	Compile LLMs into optimized engines for fast inference on NVIDIA GPUs
OpenLLM	★ 12.4k	A tool to run any open-source LLM as an OpenAI-compatible API endpoint locally or in the cloud.
NVIDIA Triton Inference Server	★ 10.8k	A multi-framework model server that runs TensorRT, PyTorch, ONNX, and other models with dynamic batching and concurrent execution.
OpenVINO	★ 10.4k	An open-source toolkit from Intel that converts and optimizes deep learning models, then runs fast inference on CPU, GPU, and NPU hardware.
LMCache	★ 9.4k	A KV-cache layer that stores and shares cached attention state across engines and requests to cut repeated computation.

// Overview

// What it does

// Getting started

Install system prerequisites

Install PyTorch and TensorRT-LLM

Generate text with the Python API

Serve a model over HTTP

// When to use it

// How TensorRT-LLM compares

Overview

What it does

Getting started

When to use it

How TensorRT-LLM compares