Overview
Transformers is Hugging Face's open-source Python library for state-of-the-art pretrained machine learning models. It covers text, computer vision, audio, video, and multimodal tasks, and supports both running models (inference) and training them. There are over a million model checkpoints on the Hugging Face Hub that you can load and use with it.
The library acts as a shared model-definition framework for the wider AI ecosystem. Once a model is defined in Transformers, that same definition works across training frameworks like Axolotl, Unsloth, DeepSpeed, and FSDP, inference engines like vLLM, SGLang, and TGI, and adjacent libraries such as llama.cpp and mlx. This keeps one agreed-upon definition instead of many incompatible ones.
Transformers aims for a low barrier to entry: a small number of user-facing classes, a unified API across every pretrained model, and a high-level Pipeline that handles preprocessing and output for you. Model files are kept deliberately simple and free of extra abstractions so researchers can iterate on a single model quickly.
What it does
- Unified API over 1M+ pretrained checkpoints on the Hugging Face Hub, with just a few user-facing classes to learn
- Covers text, computer vision, audio, video, and multimodal tasks in one library
- High-level Pipeline API that handles preprocessing and returns ready-to-use output for tasks like text generation, speech recognition, and image classification
- Shared model definition that stays compatible with training frameworks (Axolotl, Unsloth, DeepSpeed, FSDP) and inference engines (vLLM, SGLang, TGI)
- Works for both inference and training, and lets you move a single model between PyTorch, JAX, and TF2.0
- Built-in command line tools, including transformers chat and transformers serve, for interacting with models directly
Getting started
Transformers works with Python 3.10+ and PyTorch 2.4+. Install it into a virtual environment, then use the Pipeline API to run a model in a few lines.
Install Transformers
Install the library with the PyTorch extra using pip (or uv) inside an activated virtual environment.
pip install "transformers[torch]"Run a model with the Pipeline API
Create a pipeline for a task, point it at a model on the Hub, and pass it some input. The model is downloaded and cached on first use.
from transformers import pipeline
pipeline = pipeline(task="text-generation", model="Qwen/Qwen2.5-1.5B")
pipeline("the secret to baking a really good cake is ")Chat from the command line
If you want a quick interactive chat, you can talk to an instruct model straight from the terminal.
transformers chat Qwen/Qwen2.5-0.5B-InstructCommands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Run inference on pretrained models for text generation, chat, speech recognition, image classification, or visual question answering without building the pipeline yourself
- Fine-tune or train state-of-the-art models and keep the definition compatible with engines like vLLM and frameworks like DeepSpeed
- Prototype across modalities (text, vision, audio, multimodal) with one consistent API instead of separate task-specific libraries
- Load any of the 1M+ checkpoints from the Hugging Face Hub as a starting point for a project
How Transformers compares
Transformers alongside other open-source serving & deployment tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Transformers | ★ 162k | One model-definition framework for text, vision, audio, and multimodal — inference and training |
| vLLM | ★ 83.4k | A high-throughput LLM serving engine that uses PagedAttention and continuous batching to serve many requests at once. |
| SGLang | ★ 29.3k | A serving framework for LLMs and multimodal models that boosts throughput by reusing shared prompt prefixes across requests. |
| TensorRT-LLM | ★ 13.9k | NVIDIA's library that compiles LLMs into optimized engines for the fastest inference on its data-center GPUs. |
| OpenLLM | ★ 12.4k | A tool to run any open-source LLM as an OpenAI-compatible API endpoint locally or in the cloud. |
| NVIDIA Triton Inference Server | ★ 10.8k | A multi-framework model server that runs TensorRT, PyTorch, ONNX, and other models with dynamic batching and concurrent execution. |
| OpenVINO | ★ 10.4k | An open-source toolkit from Intel that converts and optimizes deep learning models, then runs fast inference on CPU, GPU, and NPU hardware. |
| LMCache | ★ 9.4k | A KV-cache layer that stores and shares cached attention state across engines and requests to cut repeated computation. |