AI/TLDR

ExLlamaV2

Run quantized EXL2 LLMs locally on consumer NVIDIA GPUs

Overview

ExLlamaV2 is an inference library for running local large language models on modern consumer NVIDIA GPUs. It focuses on the EXL2 quantization format (and also supports GPTQ), letting you fit larger models into limited VRAM and generate tokens quickly on a single card.

It is aimed at developers and self-hosters who want to run models on their own hardware instead of a hosted API. You can call it directly as a Python library, or pair it with a server like TabbyAPI to expose an OpenAI-compatible HTTP endpoint for frontends such as SillyTavern, ExUI, or text-generation-webui.

As a local runtime, it sits between your quantized model files and your application. Newer versions add a dynamic generator with batching, smart prompt caching, K/V cache deduplication, and paged attention via Flash Attention. Note that the project is now archived, with development continuing on ExLlamaV3.

What it does

  • Runs quantized LLMs in the EXL2 format, plus GPTQ models, on consumer NVIDIA GPUs
  • Dynamic generator with batching, smart prompt caching, and K/V cache deduplication
  • Paged attention via Flash Attention 2.5.7+
  • Single, batched, and asyncio-streamed generation through one simplified API
  • Q4 K/V cache mode to reduce memory use during inference
  • Multi-GPU inference with automatic GPU splitting (--gpu_split auto)

Getting started

Install from source with the CUDA Toolkit and a matching PyTorch build, then run a quick inference test against a local model. Prebuilt wheels are also available if you prefer not to compile the extension.

Install from source

Clone the repo and install. You need the CUDA Toolkit, a compiler (gcc on Linux or Visual Studio Build Tools on Windows), and a matching PyTorch version. This compiles the exllamav2_ext extension.

bashbash
git clone https://github.com/turboderp/exllamav2
cd exllamav2
pip install -r requirements.txt
pip install .

Run a test inference

Point the test script at a local EXL2 or GPTQ model. Add --gpu_split auto for multi-GPU setups.

bashbash
python test_inference.py -m <path_to_model> -p "Once upon a time,"

Try the console chatbot

An example chat script is included. The -mode flag picks the prompt format; run with -modes to list all options.

bashbash
python examples/chat.py -m <path_to_model> -mode llama -gs auto

Generate from Python

Once a model and generator are set up, the dynamic generator exposes a simple generate call for single or batched prompts.

pythonpython
output = generator.generate(prompt = "Hello, my name is", max_new_tokens = 200)

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Self-hosting a quantized LLM on a single consumer GPU instead of using a hosted API
  • Serving an OpenAI-compatible local endpoint via TabbyAPI for chat frontends like SillyTavern or ExUI
  • Fitting larger models into limited VRAM by running them in the EXL2 format
  • Batched or streamed text generation in a Python application or backend service

How ExLlamaV2 compares

ExLlamaV2 alongside other open-source local runtimes tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Ollama★ 175kA developer-friendly tool that downloads and runs local LLMs from the terminal with a built-in OpenAI-compatible API.
llama.cpp★ 117kA C/C++ inference engine that runs LLMs in the GGUF format on CPUs, Apple Silicon, and GPUs with low memory use.
GPT4All★ 77.4kGPT4All is a free desktop app and Python client that runs large language models locally on your own computer, with no API calls or GPU required.
LocalAI★ 47kA self-hosted server that exposes an OpenAI-compatible API for running text, vision, voice, and image models on local hardware.
Jan★ 43.1kAn open-source desktop app that runs LLMs fully offline as a ChatGPT-style assistant on your own computer.
llamafile★ 25kA Mozilla project that packages a model and its runtime into one executable file you can copy and run on any OS.
MLC LLM★ 22.8kA machine-learning compiler that builds and runs LLMs across browsers, phones, and desktops using TVM-based code generation.
ExLlamaV2★ 4.6kRun quantized EXL2 LLMs locally on consumer NVIDIA GPUs