Overview
PowerInfer is a CPU/GPU inference engine for running large language models on a personal computer with a single consumer-grade GPU. It builds on the idea that LLM neuron activations follow a power-law: a small set of 'hot' neurons fire on almost every input, while most 'cold' neurons only activate for specific inputs.
To use this, PowerInfer keeps the frequently activated hot neurons on the GPU for fast access and computes the rarely used cold neurons on the CPU. This cuts GPU memory pressure and CPU-GPU data transfer, so a 24 GB card like an RTX 4090 can serve models that would otherwise need server-grade hardware.
It fits the local-runtime category alongside tools like llama.cpp. PowerInfer is a separate engine but reuses a similar workflow, works with ReLU-sparse GGUF models, and can also load standard llama.cpp weights for compatibility (without the speedup). It is aimed at developers who want low-latency local inference and serving rather than cloud APIs.
What it does
- Locality-centric design: exploits sparse activation and the hot/cold neuron split for faster inference at lower resource cost
- Hybrid CPU/GPU execution: hot neurons run on the GPU, cold neurons on the CPU, balancing the workload across both
- Runs on consumer hardware: deeply optimized for low-latency inference on a single GPU
- Works with ReLU-sparse models including Falcon-40B, the Llama 2 family, ProSparse Llama 2, and Bamboo-7B
- Backward compatible: reuse most of the llama.cpp examples/ workflow such as server and batched generation
- Cross-platform: x86-64 with AVX2 on Linux and Windows (with or without NVIDIA GPU), plus Apple M chips on macOS (CPU only)
Getting started
Clone the repo, install Python requirements, build with CMake for your hardware, then run inference with a PowerInfer GGUF model.
Clone and install requirements
Get the source and install the Python dependencies used by the conversion and helper scripts.
git clone https://github.com/Tiiny-AI/PowerInfer
cd PowerInfer
pip install -r requirements.txtBuild with CMake
Use the CUBLAS flag for an NVIDIA GPU. Drop the flag for a CPU-only build.
cmake -S . -B build -DLLAMA_CUBLAS=ON
cmake --build build --config ReleaseRun inference
Point main at a PowerInfer GGUF model. GPU memory is allocated automatically; set -n for output tokens, -t for threads, and -p for the prompt.
./build/bin/main -m ./ReluFalcon-40B-PowerInfer-GGUF/falcon-40b-relu.q4.powerinfer.gguf -n 128 -t 8 -p "Once upon a time"Optional: cap VRAM use
Use --vram-budget to limit how much GPU memory PowerInfer may use when offloading hot neurons.
./build/bin/main -m /PATH/TO/MODEL -n $output_token_count -t $thread_num -p $prompt --vram-budget $vram_gbCommands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Running a 40B-class model locally on a single 24 GB consumer GPU instead of renting server-grade hardware
- Low-latency local chat or generation on a developer workstation without sending data to a cloud API
- Serving or batch-generating with ReLU-sparse models using the llama.cpp-style examples workflow
- Experimenting with the hot/cold neuron offloading approach to fit larger models within a fixed VRAM budget
How PowerInfer compares
PowerInfer alongside other open-source local runtimes tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Ollama | ★ 175k | A developer-friendly tool that downloads and runs local LLMs from the terminal with a built-in OpenAI-compatible API. |
| llama.cpp | ★ 117k | A C/C++ inference engine that runs LLMs in the GGUF format on CPUs, Apple Silicon, and GPUs with low memory use. |
| GPT4All | ★ 77.4k | GPT4All is a free desktop app and Python client that runs large language models locally on your own computer, with no API calls or GPU required. |
| LocalAI | ★ 47k | A self-hosted server that exposes an OpenAI-compatible API for running text, vision, voice, and image models on local hardware. |
| Jan | ★ 43.1k | An open-source desktop app that runs LLMs fully offline as a ChatGPT-style assistant on your own computer. |
| llamafile | ★ 25k | A Mozilla project that packages a model and its runtime into one executable file you can copy and run on any OS. |
| MLC LLM | ★ 22.8k | A machine-learning compiler that builds and runs LLMs across browsers, phones, and desktops using TVM-based code generation. |
| PowerInfer | ★ 9.6k | Run large language models fast on a single consumer GPU |