Overview
TensorRT-LLM is NVIDIA's open-source library for running large language models and visual generation models on NVIDIA GPUs. It optimizes inference with specialized kernels for common operations, an efficient runtime, and a Python framework you can use to customize and extend the system.
It is built for teams that serve models on NVIDIA data-center GPUs and want lower latency and higher throughput than a generic runtime gives. You write against a Python LLM API, and TensorRT-LLM handles the work of turning the model into an optimized engine for the target hardware.
Within the inference and serving space, it sits in the high-throughput serving layer. You can call it directly from Python for batch generation, or start an OpenAI-compatible HTTP server with the bundled trtllm-serve command to put a model behind an endpoint.
What it does
- Specialized GPU kernels for common LLM operations, plus an efficient inference runtime
- Python LLM API for generation with configurable sampling parameters
- trtllm-serve command that exposes a model over an HTTP endpoint
- Support for recent open-weight models, including day-0 support for several new releases
- Support for visual generation (diffusion) models in addition to text LLMs
- Apache-2.0 licensed and developed in the open on GitHub
Getting started
Install the prerequisites and the tensorrt_llm wheel, then run a model through the Python LLM API or the trtllm-serve command. A recent NVIDIA GPU with a matching CUDA Toolkit is required.
Install system prerequisites
On Linux, install OpenMPI before the pip packages.
sudo apt-get -y install libopenmpi-devInstall PyTorch and TensorRT-LLM
Install a matching PyTorch build, then the tensorrt_llm wheel. The wheel requires a compatible CUDA Toolkit with CUDA_HOME configured.
pip3 install torch==2.10.0 torchvision --index-url https://download.pytorch.org/whl/cu130
pip3 install --ignore-installed pip setuptools wheel && pip3 install tensorrt_llmGenerate text with the Python API
Load a model with the LLM class and call generate with your sampling parameters.
from tensorrt_llm import LLM, SamplingParams
llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
prompts = ["Hello, my name is"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
for output in llm.generate(prompts, sampling_params):
print(f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}")Serve a model over HTTP
Use trtllm-serve to put a model behind an endpoint.
trtllm-serve "TinyLlama/TinyLlama-1.1B-Chat-v1.0"Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Serving an open-weight LLM behind an HTTP endpoint on NVIDIA data-center GPUs
- Lowering latency and raising throughput for a model already running on a generic runtime
- Running batch text generation from Python with custom sampling settings
- Serving diffusion-based visual generation models alongside text LLMs on the same stack
How TensorRT-LLM compares
TensorRT-LLM alongside other open-source serving & deployment tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Transformers | ★ 162k | Hugging Face Transformers is a Python framework that defines and runs state-of-the-art pretrained models for text, vision, audio, and multimodal tasks, for both inference and training. |
| vLLM | ★ 83.4k | A high-throughput LLM serving engine that uses PagedAttention and continuous batching to serve many requests at once. |
| SGLang | ★ 29.3k | A serving framework for LLMs and multimodal models that boosts throughput by reusing shared prompt prefixes across requests. |
| TensorRT-LLM | ★ 13.9k | Compile LLMs into optimized engines for fast inference on NVIDIA GPUs |
| OpenLLM | ★ 12.4k | A tool to run any open-source LLM as an OpenAI-compatible API endpoint locally or in the cloud. |
| NVIDIA Triton Inference Server | ★ 10.8k | A multi-framework model server that runs TensorRT, PyTorch, ONNX, and other models with dynamic batching and concurrent execution. |
| OpenVINO | ★ 10.4k | An open-source toolkit from Intel that converts and optimizes deep learning models, then runs fast inference on CPU, GPU, and NPU hardware. |
| LMCache | ★ 9.4k | A KV-cache layer that stores and shares cached attention state across engines and requests to cut repeated computation. |