AI/TLDR

TensorRT-LLM

Compile LLMs into optimized engines for fast inference on NVIDIA GPUs

Overview

TensorRT-LLM is NVIDIA's open-source library for running large language models and visual generation models on NVIDIA GPUs. It optimizes inference with specialized kernels for common operations, an efficient runtime, and a Python framework you can use to customize and extend the system.

It is built for teams that serve models on NVIDIA data-center GPUs and want lower latency and higher throughput than a generic runtime gives. You write against a Python LLM API, and TensorRT-LLM handles the work of turning the model into an optimized engine for the target hardware.

Within the inference and serving space, it sits in the high-throughput serving layer. You can call it directly from Python for batch generation, or start an OpenAI-compatible HTTP server with the bundled trtllm-serve command to put a model behind an endpoint.

What it does

  • Specialized GPU kernels for common LLM operations, plus an efficient inference runtime
  • Python LLM API for generation with configurable sampling parameters
  • trtllm-serve command that exposes a model over an HTTP endpoint
  • Support for recent open-weight models, including day-0 support for several new releases
  • Support for visual generation (diffusion) models in addition to text LLMs
  • Apache-2.0 licensed and developed in the open on GitHub

Getting started

Install the prerequisites and the tensorrt_llm wheel, then run a model through the Python LLM API or the trtllm-serve command. A recent NVIDIA GPU with a matching CUDA Toolkit is required.

Install system prerequisites

On Linux, install OpenMPI before the pip packages.

bashbash
sudo apt-get -y install libopenmpi-dev

Install PyTorch and TensorRT-LLM

Install a matching PyTorch build, then the tensorrt_llm wheel. The wheel requires a compatible CUDA Toolkit with CUDA_HOME configured.

bashbash
pip3 install torch==2.10.0 torchvision --index-url https://download.pytorch.org/whl/cu130
pip3 install --ignore-installed pip setuptools wheel && pip3 install tensorrt_llm

Generate text with the Python API

Load a model with the LLM class and call generate with your sampling parameters.

pythonpython
from tensorrt_llm import LLM, SamplingParams

llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
prompts = ["Hello, my name is"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

for output in llm.generate(prompts, sampling_params):
    print(f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}")

Serve a model over HTTP

Use trtllm-serve to put a model behind an endpoint.

bashbash
trtllm-serve "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Serving an open-weight LLM behind an HTTP endpoint on NVIDIA data-center GPUs
  • Lowering latency and raising throughput for a model already running on a generic runtime
  • Running batch text generation from Python with custom sampling settings
  • Serving diffusion-based visual generation models alongside text LLMs on the same stack

How TensorRT-LLM compares

TensorRT-LLM alongside other open-source serving & deployment tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Transformers★ 162kHugging Face Transformers is a Python framework that defines and runs state-of-the-art pretrained models for text, vision, audio, and multimodal tasks, for both inference and training.
vLLM★ 83.4kA high-throughput LLM serving engine that uses PagedAttention and continuous batching to serve many requests at once.
SGLang★ 29.3kA serving framework for LLMs and multimodal models that boosts throughput by reusing shared prompt prefixes across requests.
TensorRT-LLM★ 13.9kCompile LLMs into optimized engines for fast inference on NVIDIA GPUs
OpenLLM★ 12.4kA tool to run any open-source LLM as an OpenAI-compatible API endpoint locally or in the cloud.
NVIDIA Triton Inference Server★ 10.8kA multi-framework model server that runs TensorRT, PyTorch, ONNX, and other models with dynamic batching and concurrent execution.
OpenVINO★ 10.4kAn open-source toolkit from Intel that converts and optimizes deep learning models, then runs fast inference on CPU, GPU, and NPU hardware.
LMCache★ 9.4kA KV-cache layer that stores and shares cached attention state across engines and requests to cut repeated computation.