Overview
ExLlamaV2 is an inference library for running local large language models on modern consumer NVIDIA GPUs. It focuses on the EXL2 quantization format (and also supports GPTQ), letting you fit larger models into limited VRAM and generate tokens quickly on a single card.
It is aimed at developers and self-hosters who want to run models on their own hardware instead of a hosted API. You can call it directly as a Python library, or pair it with a server like TabbyAPI to expose an OpenAI-compatible HTTP endpoint for frontends such as SillyTavern, ExUI, or text-generation-webui.
As a local runtime, it sits between your quantized model files and your application. Newer versions add a dynamic generator with batching, smart prompt caching, K/V cache deduplication, and paged attention via Flash Attention. Note that the project is now archived, with development continuing on ExLlamaV3.
What it does
- Runs quantized LLMs in the EXL2 format, plus GPTQ models, on consumer NVIDIA GPUs
- Dynamic generator with batching, smart prompt caching, and K/V cache deduplication
- Paged attention via Flash Attention 2.5.7+
- Single, batched, and asyncio-streamed generation through one simplified API
- Q4 K/V cache mode to reduce memory use during inference
- Multi-GPU inference with automatic GPU splitting (--gpu_split auto)
Getting started
Install from source with the CUDA Toolkit and a matching PyTorch build, then run a quick inference test against a local model. Prebuilt wheels are also available if you prefer not to compile the extension.
Install from source
Clone the repo and install. You need the CUDA Toolkit, a compiler (gcc on Linux or Visual Studio Build Tools on Windows), and a matching PyTorch version. This compiles the exllamav2_ext extension.
git clone https://github.com/turboderp/exllamav2
cd exllamav2
pip install -r requirements.txt
pip install .Run a test inference
Point the test script at a local EXL2 or GPTQ model. Add --gpu_split auto for multi-GPU setups.
python test_inference.py -m <path_to_model> -p "Once upon a time,"Try the console chatbot
An example chat script is included. The -mode flag picks the prompt format; run with -modes to list all options.
python examples/chat.py -m <path_to_model> -mode llama -gs autoGenerate from Python
Once a model and generator are set up, the dynamic generator exposes a simple generate call for single or batched prompts.
output = generator.generate(prompt = "Hello, my name is", max_new_tokens = 200)Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Self-hosting a quantized LLM on a single consumer GPU instead of using a hosted API
- Serving an OpenAI-compatible local endpoint via TabbyAPI for chat frontends like SillyTavern or ExUI
- Fitting larger models into limited VRAM by running them in the EXL2 format
- Batched or streamed text generation in a Python application or backend service
How ExLlamaV2 compares
ExLlamaV2 alongside other open-source local runtimes tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Ollama | ★ 175k | A developer-friendly tool that downloads and runs local LLMs from the terminal with a built-in OpenAI-compatible API. |
| llama.cpp | ★ 117k | A C/C++ inference engine that runs LLMs in the GGUF format on CPUs, Apple Silicon, and GPUs with low memory use. |
| GPT4All | ★ 77.4k | GPT4All is a free desktop app and Python client that runs large language models locally on your own computer, with no API calls or GPU required. |
| LocalAI | ★ 47k | A self-hosted server that exposes an OpenAI-compatible API for running text, vision, voice, and image models on local hardware. |
| Jan | ★ 43.1k | An open-source desktop app that runs LLMs fully offline as a ChatGPT-style assistant on your own computer. |
| llamafile | ★ 25k | A Mozilla project that packages a model and its runtime into one executable file you can copy and run on any OS. |
| MLC LLM | ★ 22.8k | A machine-learning compiler that builds and runs LLMs across browsers, phones, and desktops using TVM-based code generation. |
| ExLlamaV2 | ★ 4.6k | Run quantized EXL2 LLMs locally on consumer NVIDIA GPUs |