ExLlamaV2

Run quantized EXL2 LLMs locally on consumer NVIDIA GPUs

github.com/turboderp-org/exllamav2★ 4.6k

Overview

ExLlamaV2 is an inference library for running local large language models on modern consumer NVIDIA GPUs. It focuses on the EXL2 quantization format (and also supports GPTQ), letting you fit larger models into limited VRAM and generate tokens quickly on a single card.

It is aimed at developers and self-hosters who want to run models on their own hardware instead of a hosted API. You can call it directly as a Python library, or pair it with a server like TabbyAPI to expose an OpenAI-compatible HTTP endpoint for frontends such as SillyTavern, ExUI, or text-generation-webui.

As a local runtime, it sits between your quantized model files and your application. Newer versions add a dynamic generator with batching, smart prompt caching, K/V cache deduplication, and paged attention via Flash Attention. Note that the project is now archived, with development continuing on ExLlamaV3.

What it does

Runs quantized LLMs in the EXL2 format, plus GPTQ models, on consumer NVIDIA GPUs
Dynamic generator with batching, smart prompt caching, and K/V cache deduplication
Paged attention via Flash Attention 2.5.7+
Single, batched, and asyncio-streamed generation through one simplified API
Q4 K/V cache mode to reduce memory use during inference
Multi-GPU inference with automatic GPU splitting (--gpu_split auto)

Getting started

Install from source with the CUDA Toolkit and a matching PyTorch build, then run a quick inference test against a local model. Prebuilt wheels are also available if you prefer not to compile the extension.

Install from source

Clone the repo and install. You need the CUDA Toolkit, a compiler (gcc on Linux or Visual Studio Build Tools on Windows), and a matching PyTorch version. This compiles the exllamav2_ext extension.

bashbash

git clone https://github.com/turboderp/exllamav2
cd exllamav2
pip install -r requirements.txt
pip install .

Run a test inference

Point the test script at a local EXL2 or GPTQ model. Add --gpu_split auto for multi-GPU setups.

bashbash

python test_inference.py -m <path_to_model> -p "Once upon a time,"

Try the console chatbot

An example chat script is included. The -mode flag picks the prompt format; run with -modes to list all options.

bashbash

python examples/chat.py -m <path_to_model> -mode llama -gs auto

Generate from Python

Once a model and generator are set up, the dynamic generator exposes a simple generate call for single or batched prompts.

pythonpython

output = generator.generate(prompt = "Hello, my name is", max_new_tokens = 200)

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Self-hosting a quantized LLM on a single consumer GPU instead of using a hosted API
Serving an OpenAI-compatible local endpoint via TabbyAPI for chat frontends like SillyTavern or ExUI
Fitting larger models into limited VRAM by running them in the EXL2 format
Batched or streamed text generation in a Python application or backend service

How ExLlamaV2 compares

ExLlamaV2 alongside other open-source local runtimes tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Ollama	★ 175k	A developer-friendly tool that downloads and runs local LLMs from the terminal with a built-in OpenAI-compatible API.
llama.cpp	★ 117k	A C/C++ inference engine that runs LLMs in the GGUF format on CPUs, Apple Silicon, and GPUs with low memory use.
GPT4All	★ 77.4k	GPT4All is a free desktop app and Python client that runs large language models locally on your own computer, with no API calls or GPU required.
LocalAI	★ 47k	A self-hosted server that exposes an OpenAI-compatible API for running text, vision, voice, and image models on local hardware.
Jan	★ 43.1k	An open-source desktop app that runs LLMs fully offline as a ChatGPT-style assistant on your own computer.
llamafile	★ 25k	A Mozilla project that packages a model and its runtime into one executable file you can copy and run on any OS.
MLC LLM	★ 22.8k	A machine-learning compiler that builds and runs LLMs across browsers, phones, and desktops using TVM-based code generation.
ExLlamaV2	★ 4.6k	Run quantized EXL2 LLMs locally on consumer NVIDIA GPUs

// Overview

// What it does

// Getting started

Install from source

Run a test inference

Try the console chatbot

Generate from Python

// When to use it

// How ExLlamaV2 compares

Overview

What it does

Getting started

When to use it

How ExLlamaV2 compares