PowerInfer

Run large language models fast on a single consumer GPU

Overview

PowerInfer is a CPU/GPU inference engine for running large language models on a personal computer with a single consumer-grade GPU. It builds on the idea that LLM neuron activations follow a power-law: a small set of 'hot' neurons fire on almost every input, while most 'cold' neurons only activate for specific inputs.

To use this, PowerInfer keeps the frequently activated hot neurons on the GPU for fast access and computes the rarely used cold neurons on the CPU. This cuts GPU memory pressure and CPU-GPU data transfer, so a 24 GB card like an RTX 4090 can serve models that would otherwise need server-grade hardware.

It fits the local-runtime category alongside tools like llama.cpp. PowerInfer is a separate engine but reuses a similar workflow, works with ReLU-sparse GGUF models, and can also load standard llama.cpp weights for compatibility (without the speedup). It is aimed at developers who want low-latency local inference and serving rather than cloud APIs.

What it does

Locality-centric design: exploits sparse activation and the hot/cold neuron split for faster inference at lower resource cost
Hybrid CPU/GPU execution: hot neurons run on the GPU, cold neurons on the CPU, balancing the workload across both
Runs on consumer hardware: deeply optimized for low-latency inference on a single GPU
Works with ReLU-sparse models including Falcon-40B, the Llama 2 family, ProSparse Llama 2, and Bamboo-7B
Backward compatible: reuse most of the llama.cpp examples/ workflow such as server and batched generation
Cross-platform: x86-64 with AVX2 on Linux and Windows (with or without NVIDIA GPU), plus Apple M chips on macOS (CPU only)

Getting started

Clone the repo, install Python requirements, build with CMake for your hardware, then run inference with a PowerInfer GGUF model.

Clone and install requirements

Get the source and install the Python dependencies used by the conversion and helper scripts.

bashbash

git clone https://github.com/Tiiny-AI/PowerInfer
cd PowerInfer
pip install -r requirements.txt

Build with CMake

Use the CUBLAS flag for an NVIDIA GPU. Drop the flag for a CPU-only build.

bashbash

cmake -S . -B build -DLLAMA_CUBLAS=ON
cmake --build build --config Release

Run inference

Point main at a PowerInfer GGUF model. GPU memory is allocated automatically; set -n for output tokens, -t for threads, and -p for the prompt.

bashbash

./build/bin/main -m ./ReluFalcon-40B-PowerInfer-GGUF/falcon-40b-relu.q4.powerinfer.gguf -n 128 -t 8 -p "Once upon a time"

Optional: cap VRAM use

Use --vram-budget to limit how much GPU memory PowerInfer may use when offloading hot neurons.

bashbash

./build/bin/main -m /PATH/TO/MODEL -n $output_token_count -t $thread_num -p $prompt --vram-budget $vram_gb

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Running a 40B-class model locally on a single 24 GB consumer GPU instead of renting server-grade hardware
Low-latency local chat or generation on a developer workstation without sending data to a cloud API
Serving or batch-generating with ReLU-sparse models using the llama.cpp-style examples workflow
Experimenting with the hot/cold neuron offloading approach to fit larger models within a fixed VRAM budget

How PowerInfer compares

PowerInfer alongside other open-source local runtimes tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Ollama	★ 175k	A developer-friendly tool that downloads and runs local LLMs from the terminal with a built-in OpenAI-compatible API.
llama.cpp	★ 117k	A C/C++ inference engine that runs LLMs in the GGUF format on CPUs, Apple Silicon, and GPUs with low memory use.
GPT4All	★ 77.4k	GPT4All is a free desktop app and Python client that runs large language models locally on your own computer, with no API calls or GPU required.
LocalAI	★ 47k	A self-hosted server that exposes an OpenAI-compatible API for running text, vision, voice, and image models on local hardware.
Jan	★ 43.1k	An open-source desktop app that runs LLMs fully offline as a ChatGPT-style assistant on your own computer.
llamafile	★ 25k	A Mozilla project that packages a model and its runtime into one executable file you can copy and run on any OS.
MLC LLM	★ 22.8k	A machine-learning compiler that builds and runs LLMs across browsers, phones, and desktops using TVM-based code generation.
PowerInfer	★ 9.6k	Run large language models fast on a single consumer GPU

// Overview

// What it does

// Getting started

Clone and install requirements

Build with CMake

Run inference

Optional: cap VRAM use

// When to use it

// How PowerInfer compares

Overview

What it does

Getting started

When to use it

How PowerInfer compares