Overview
ExLlamaV3 is an inference library for running local large language models on modern consumer GPUs. It is the next version of the ExLlama project and introduces the EXL3 quantization format, based on QTIP, which lets you fit larger models into limited VRAM while keeping inference fast.
It is built for people who want to run models on their own hardware rather than calling a hosted API. You can quantize a Hugging Face model to EXL3 and serve it through TabbyAPI, the recommended backend that exposes an OpenAI-compatible endpoint. It also plugs into HF Transformers as a backend.
As a local runtime, it sits between your downloaded model weights and the application that talks to them. It supports tensor-parallel and expert-parallel inference, continuous dynamic batching, speculative decoding, cache quantization, multimodal models, and LoRA adapters across a wide range of model architectures.
What it does
- EXL3 quantization format based on QTIP for fitting larger models into limited GPU memory
- Tensor-parallel and expert-parallel inference across consumer multi-GPU setups
- Continuous, dynamic batching for serving multiple requests at once
- OpenAI-compatible server through the TabbyAPI backend, plus a HF Transformers plugin
- Speculative decoding, 2-8 bit cache quantization, multimodal support, and LoRA adapters
- Broad model architecture support including Llama, Mistral, Qwen, Gemma, GLM, and Command-R
Getting started
Make sure you have a CUDA 12.4 or later build of PyTorch installed first, since the Torch dependency is not handled automatically by pip. Then install ExLlamaV3 and convert a model to EXL3.
Install from a prebuilt wheel (recommended)
Pick a wheel matching your CUDA, Torch, and Python version from the releases page, then install it directly with pip. This is the simplest path if you are unsure which method to use.
pip install https://github.com/turboderp-org/exllamav3/releases/download/v0.0.6/exllamav3-0.0.6+cu128.torch2.8.0-cp313-cp313-linux_x86_64.whlOr install from PyPI
The PyPI package has no prebuilt extension, so it needs the CUDA toolkit and build prerequisites (VS Build Tools on Windows, gcc on Linux, python-dev headers).
pip install exllamav3Convert a model to EXL3
Use the conversion script to quantize a Hugging Face model. The working directory holds checkpoints and quantized tensors and needs room for a full copy of the output model.
python convert.py -i <input_dir> -o <output_dir> -w <working_dir> -b <bitrate>Serve with TabbyAPI
TabbyAPI is the recommended backend and provides an OpenAI-compatible API, with a startup script that installs prerequisites and manages inference. Point it at your converted EXL3 model to start serving.
Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Run a large open-weight model on a single consumer GPU by quantizing it to EXL3 to save VRAM
- Stand up a private, OpenAI-compatible LLM endpoint on your own hardware via TabbyAPI
- Spread inference across multiple consumer GPUs using tensor-parallel or expert-parallel modes
- Use it as a faster local backend behind HF Transformers for experimentation
How ExLlamaV3 compares
ExLlamaV3 alongside other open-source local runtimes tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Ollama | ★ 175k | A developer-friendly tool that downloads and runs local LLMs from the terminal with a built-in OpenAI-compatible API. |
| llama.cpp | ★ 117k | A C/C++ inference engine that runs LLMs in the GGUF format on CPUs, Apple Silicon, and GPUs with low memory use. |
| GPT4All | ★ 77.4k | GPT4All is a free desktop app and Python client that runs large language models locally on your own computer, with no API calls or GPU required. |
| LocalAI | ★ 47k | A self-hosted server that exposes an OpenAI-compatible API for running text, vision, voice, and image models on local hardware. |
| Jan | ★ 43.1k | An open-source desktop app that runs LLMs fully offline as a ChatGPT-style assistant on your own computer. |
| llamafile | ★ 25k | A Mozilla project that packages a model and its runtime into one executable file you can copy and run on any OS. |
| MLC LLM | ★ 22.8k | A machine-learning compiler that builds and runs LLMs across browsers, phones, and desktops using TVM-based code generation. |
| ExLlamaV3 | ★ 960 | Run quantized local LLMs on consumer GPUs with the EXL3 format |