Overview
llama.cpp is a plain C/C++ inference engine for running large language models locally and in the cloud. It loads models in the GGUF format and runs them on a wide range of hardware with minimal setup, from a laptop CPU to NVIDIA, AMD, and Apple GPUs.
It is built for developers who want to run open models on their own machines without a heavy Python stack or external dependencies. Integer quantization (from 1.5-bit up to 8-bit) lowers memory use, and CPU+GPU hybrid inference lets you partially accelerate models that are larger than your total VRAM.
As a local runtime in the inference and serving space, llama.cpp gives you both a command-line tool (llama-cli) for one-off prompts and an OpenAI-compatible server (llama-server) you can point existing client code at. It is also the main playground for the underlying ggml library.
What it does
- Plain C/C++ implementation with no external dependencies
- Runs GGUF models on CPU, Apple Silicon (Metal/NEON/Accelerate), and GPUs via CUDA, HIP, MUSA, Vulkan, and SYCL
- Integer quantization from 1.5-bit to 8-bit for faster inference and reduced memory use
- CPU+GPU hybrid inference to partially accelerate models larger than total VRAM
- Built-in OpenAI-compatible REST API server (llama-server), with multimodal support
- Download and run models directly from Hugging Face with the -hf flag
Getting started
Install a pre-built binary or build from source, then point llama.cpp at a GGUF model file or a Hugging Face repo.
Install llama.cpp
Install with a package manager (brew, nix, or winget), run it with Docker, download a pre-built binary from the releases page, or build from source. See the project's install and build guides for details.
brew install llama.cppRun a model from the command line
Use llama-cli with a local GGUF file, or pass -hf to download and run a model straight from Hugging Face.
# Use a local model file
llama-cli -m my_model.gguf
# Or download and run a model directly from Hugging Face
llama-cli -hf ggml-org/gemma-3-1b-it-GGUFLaunch the OpenAI-compatible server
Start llama-server to expose a REST API that OpenAI-compatible clients can call.
llama-server -hf ggml-org/gemma-3-1b-it-GGUFCommands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Run open models offline on a laptop or workstation without a Python stack
- Serve a local OpenAI-compatible API for apps and agents during development
- Fit larger models on limited hardware using quantization and CPU+GPU hybrid inference
- Run inference on Apple Silicon or non-NVIDIA GPUs via Metal, Vulkan, HIP, or SYCL
How llama.cpp compares
llama.cpp alongside other open-source local runtimes tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Ollama | ★ 175k | A developer-friendly tool that downloads and runs local LLMs from the terminal with a built-in OpenAI-compatible API. |
| llama.cpp | ★ 117k | Run LLMs in C/C++ on CPU, Apple Silicon, and GPU with low memory use |
| GPT4All | ★ 77.4k | GPT4All is a free desktop app and Python client that runs large language models locally on your own computer, with no API calls or GPU required. |
| LocalAI | ★ 47k | A self-hosted server that exposes an OpenAI-compatible API for running text, vision, voice, and image models on local hardware. |
| Jan | ★ 43.1k | An open-source desktop app that runs LLMs fully offline as a ChatGPT-style assistant on your own computer. |
| llamafile | ★ 25k | A Mozilla project that packages a model and its runtime into one executable file you can copy and run on any OS. |
| MLC LLM | ★ 22.8k | A machine-learning compiler that builds and runs LLMs across browsers, phones, and desktops using TVM-based code generation. |
| KTransformers | ★ 17.3k | A framework for running large Mixture-of-Experts models locally by splitting work between CPU and GPU to fit limited VRAM. |