ExLlamaV3

Run quantized local LLMs on consumer GPUs with the EXL3 format

Overview

ExLlamaV3 is an inference library for running local large language models on modern consumer GPUs. It is the next version of the ExLlama project and introduces the EXL3 quantization format, based on QTIP, which lets you fit larger models into limited VRAM while keeping inference fast.

It is built for people who want to run models on their own hardware rather than calling a hosted API. You can quantize a Hugging Face model to EXL3 and serve it through TabbyAPI, the recommended backend that exposes an OpenAI-compatible endpoint. It also plugs into HF Transformers as a backend.

As a local runtime, it sits between your downloaded model weights and the application that talks to them. It supports tensor-parallel and expert-parallel inference, continuous dynamic batching, speculative decoding, cache quantization, multimodal models, and LoRA adapters across a wide range of model architectures.

What it does

EXL3 quantization format based on QTIP for fitting larger models into limited GPU memory
Tensor-parallel and expert-parallel inference across consumer multi-GPU setups
Continuous, dynamic batching for serving multiple requests at once
OpenAI-compatible server through the TabbyAPI backend, plus a HF Transformers plugin
Speculative decoding, 2-8 bit cache quantization, multimodal support, and LoRA adapters
Broad model architecture support including Llama, Mistral, Qwen, Gemma, GLM, and Command-R

Getting started

Make sure you have a CUDA 12.4 or later build of PyTorch installed first, since the Torch dependency is not handled automatically by pip. Then install ExLlamaV3 and convert a model to EXL3.

Install from a prebuilt wheel (recommended)

Pick a wheel matching your CUDA, Torch, and Python version from the releases page, then install it directly with pip. This is the simplest path if you are unsure which method to use.

bashbash

pip install https://github.com/turboderp-org/exllamav3/releases/download/v0.0.6/exllamav3-0.0.6+cu128.torch2.8.0-cp313-cp313-linux_x86_64.whl

Or install from PyPI

The PyPI package has no prebuilt extension, so it needs the CUDA toolkit and build prerequisites (VS Build Tools on Windows, gcc on Linux, python-dev headers).

bashbash

pip install exllamav3

Convert a model to EXL3

Use the conversion script to quantize a Hugging Face model. The working directory holds checkpoints and quantized tensors and needs room for a full copy of the output model.

bashbash

python convert.py -i <input_dir> -o <output_dir> -w <working_dir> -b <bitrate>

Serve with TabbyAPI

TabbyAPI is the recommended backend and provides an OpenAI-compatible API, with a startup script that installs prerequisites and manages inference. Point it at your converted EXL3 model to start serving.

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Run a large open-weight model on a single consumer GPU by quantizing it to EXL3 to save VRAM
Stand up a private, OpenAI-compatible LLM endpoint on your own hardware via TabbyAPI
Spread inference across multiple consumer GPUs using tensor-parallel or expert-parallel modes
Use it as a faster local backend behind HF Transformers for experimentation

How ExLlamaV3 compares

ExLlamaV3 alongside other open-source local runtimes tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Ollama	★ 175k	A developer-friendly tool that downloads and runs local LLMs from the terminal with a built-in OpenAI-compatible API.
llama.cpp	★ 117k	A C/C++ inference engine that runs LLMs in the GGUF format on CPUs, Apple Silicon, and GPUs with low memory use.
GPT4All	★ 77.4k	GPT4All is a free desktop app and Python client that runs large language models locally on your own computer, with no API calls or GPU required.
LocalAI	★ 47k	A self-hosted server that exposes an OpenAI-compatible API for running text, vision, voice, and image models on local hardware.
Jan	★ 43.1k	An open-source desktop app that runs LLMs fully offline as a ChatGPT-style assistant on your own computer.
llamafile	★ 25k	A Mozilla project that packages a model and its runtime into one executable file you can copy and run on any OS.
MLC LLM	★ 22.8k	A machine-learning compiler that builds and runs LLMs across browsers, phones, and desktops using TVM-based code generation.
ExLlamaV3	★ 960	Run quantized local LLMs on consumer GPUs with the EXL3 format

// Overview

// What it does

// Getting started

Install from a prebuilt wheel (recommended)

Or install from PyPI

Convert a model to EXL3

Serve with TabbyAPI

// When to use it

// How ExLlamaV3 compares

Overview

What it does

Getting started

When to use it

How ExLlamaV3 compares