Transformers

One model-definition framework for text, vision, audio, and multimodal — inference and training

github.com/huggingface/transformers★ 162k huggingface.co/transformers

Overview

Transformers is Hugging Face's open-source Python library for state-of-the-art pretrained machine learning models. It covers text, computer vision, audio, video, and multimodal tasks, and supports both running models (inference) and training them. There are over a million model checkpoints on the Hugging Face Hub that you can load and use with it.

The library acts as a shared model-definition framework for the wider AI ecosystem. Once a model is defined in Transformers, that same definition works across training frameworks like Axolotl, Unsloth, DeepSpeed, and FSDP, inference engines like vLLM, SGLang, and TGI, and adjacent libraries such as llama.cpp and mlx. This keeps one agreed-upon definition instead of many incompatible ones.

Transformers aims for a low barrier to entry: a small number of user-facing classes, a unified API across every pretrained model, and a high-level Pipeline that handles preprocessing and output for you. Model files are kept deliberately simple and free of extra abstractions so researchers can iterate on a single model quickly.

What it does

Unified API over 1M+ pretrained checkpoints on the Hugging Face Hub, with just a few user-facing classes to learn
Covers text, computer vision, audio, video, and multimodal tasks in one library
High-level Pipeline API that handles preprocessing and returns ready-to-use output for tasks like text generation, speech recognition, and image classification
Shared model definition that stays compatible with training frameworks (Axolotl, Unsloth, DeepSpeed, FSDP) and inference engines (vLLM, SGLang, TGI)
Works for both inference and training, and lets you move a single model between PyTorch, JAX, and TF2.0
Built-in command line tools, including transformers chat and transformers serve, for interacting with models directly

Getting started

Transformers works with Python 3.10+ and PyTorch 2.4+. Install it into a virtual environment, then use the Pipeline API to run a model in a few lines.

Install Transformers

Install the library with the PyTorch extra using pip (or uv) inside an activated virtual environment.

bashbash

pip install "transformers[torch]"

Run a model with the Pipeline API

Create a pipeline for a task, point it at a model on the Hub, and pass it some input. The model is downloaded and cached on first use.

pythonpython

from transformers import pipeline

pipeline = pipeline(task="text-generation", model="Qwen/Qwen2.5-1.5B")
pipeline("the secret to baking a really good cake is ")

Chat from the command line

If you want a quick interactive chat, you can talk to an instruct model straight from the terminal.

bashbash

transformers chat Qwen/Qwen2.5-0.5B-Instruct

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Run inference on pretrained models for text generation, chat, speech recognition, image classification, or visual question answering without building the pipeline yourself
Fine-tune or train state-of-the-art models and keep the definition compatible with engines like vLLM and frameworks like DeepSpeed
Prototype across modalities (text, vision, audio, multimodal) with one consistent API instead of separate task-specific libraries
Load any of the 1M+ checkpoints from the Hugging Face Hub as a starting point for a project

How Transformers compares

Transformers alongside other open-source serving & deployment tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Transformers	★ 162k	One model-definition framework for text, vision, audio, and multimodal — inference and training
vLLM	★ 83.4k	A high-throughput LLM serving engine that uses PagedAttention and continuous batching to serve many requests at once.
SGLang	★ 29.3k	A serving framework for LLMs and multimodal models that boosts throughput by reusing shared prompt prefixes across requests.
TensorRT-LLM	★ 13.9k	NVIDIA's library that compiles LLMs into optimized engines for the fastest inference on its data-center GPUs.
OpenLLM	★ 12.4k	A tool to run any open-source LLM as an OpenAI-compatible API endpoint locally or in the cloud.
NVIDIA Triton Inference Server	★ 10.8k	A multi-framework model server that runs TensorRT, PyTorch, ONNX, and other models with dynamic batching and concurrent execution.
OpenVINO	★ 10.4k	An open-source toolkit from Intel that converts and optimizes deep learning models, then runs fast inference on CPU, GPU, and NPU hardware.
LMCache	★ 9.4k	A KV-cache layer that stores and shares cached attention state across engines and requests to cut repeated computation.

// Overview

// What it does

// Getting started