llama.cpp

Run LLMs in C/C++ on CPU, Apple Silicon, and GPU with low memory use

Overview

llama.cpp is a plain C/C++ inference engine for running large language models locally and in the cloud. It loads models in the GGUF format and runs them on a wide range of hardware with minimal setup, from a laptop CPU to NVIDIA, AMD, and Apple GPUs.

It is built for developers who want to run open models on their own machines without a heavy Python stack or external dependencies. Integer quantization (from 1.5-bit up to 8-bit) lowers memory use, and CPU+GPU hybrid inference lets you partially accelerate models that are larger than your total VRAM.

As a local runtime in the inference and serving space, llama.cpp gives you both a command-line tool (llama-cli) for one-off prompts and an OpenAI-compatible server (llama-server) you can point existing client code at. It is also the main playground for the underlying ggml library.

What it does

Plain C/C++ implementation with no external dependencies
Runs GGUF models on CPU, Apple Silicon (Metal/NEON/Accelerate), and GPUs via CUDA, HIP, MUSA, Vulkan, and SYCL
Integer quantization from 1.5-bit to 8-bit for faster inference and reduced memory use
CPU+GPU hybrid inference to partially accelerate models larger than total VRAM
Built-in OpenAI-compatible REST API server (llama-server), with multimodal support
Download and run models directly from Hugging Face with the -hf flag

Getting started

Install a pre-built binary or build from source, then point llama.cpp at a GGUF model file or a Hugging Face repo.

Install llama.cpp

Install with a package manager (brew, nix, or winget), run it with Docker, download a pre-built binary from the releases page, or build from source. See the project's install and build guides for details.

bashbash

brew install llama.cpp

Run a model from the command line

Use llama-cli with a local GGUF file, or pass -hf to download and run a model straight from Hugging Face.

bashbash

# Use a local model file
llama-cli -m my_model.gguf

# Or download and run a model directly from Hugging Face
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

Launch the OpenAI-compatible server

Start llama-server to expose a REST API that OpenAI-compatible clients can call.

bashbash

llama-server -hf ggml-org/gemma-3-1b-it-GGUF

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Run open models offline on a laptop or workstation without a Python stack
Serve a local OpenAI-compatible API for apps and agents during development
Fit larger models on limited hardware using quantization and CPU+GPU hybrid inference
Run inference on Apple Silicon or non-NVIDIA GPUs via Metal, Vulkan, HIP, or SYCL

How llama.cpp compares

llama.cpp alongside other open-source local runtimes tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Ollama	★ 175k	A developer-friendly tool that downloads and runs local LLMs from the terminal with a built-in OpenAI-compatible API.
llama.cpp	★ 117k	Run LLMs in C/C++ on CPU, Apple Silicon, and GPU with low memory use
GPT4All	★ 77.4k	GPT4All is a free desktop app and Python client that runs large language models locally on your own computer, with no API calls or GPU required.
LocalAI	★ 47k	A self-hosted server that exposes an OpenAI-compatible API for running text, vision, voice, and image models on local hardware.
Jan	★ 43.1k	An open-source desktop app that runs LLMs fully offline as a ChatGPT-style assistant on your own computer.
llamafile	★ 25k	A Mozilla project that packages a model and its runtime into one executable file you can copy and run on any OS.
MLC LLM	★ 22.8k	A machine-learning compiler that builds and runs LLMs across browsers, phones, and desktops using TVM-based code generation.
KTransformers	★ 17.3k	A framework for running large Mixture-of-Experts models locally by splitting work between CPU and GPU to fit limited VRAM.

// Overview

// What it does

// Getting started

Install llama.cpp

Run a model from the command line

Launch the OpenAI-compatible server

// When to use it

// How llama.cpp compares

Overview

What it does

Getting started

When to use it

How llama.cpp compares