AI/TLDR

llama.cpp

Run LLMs in C/C++ on CPU, Apple Silicon, and GPU with low memory use

Overview

llama.cpp is a plain C/C++ inference engine for running large language models locally and in the cloud. It loads models in the GGUF format and runs them on a wide range of hardware with minimal setup, from a laptop CPU to NVIDIA, AMD, and Apple GPUs.

It is built for developers who want to run open models on their own machines without a heavy Python stack or external dependencies. Integer quantization (from 1.5-bit up to 8-bit) lowers memory use, and CPU+GPU hybrid inference lets you partially accelerate models that are larger than your total VRAM.

As a local runtime in the inference and serving space, llama.cpp gives you both a command-line tool (llama-cli) for one-off prompts and an OpenAI-compatible server (llama-server) you can point existing client code at. It is also the main playground for the underlying ggml library.

What it does

  • Plain C/C++ implementation with no external dependencies
  • Runs GGUF models on CPU, Apple Silicon (Metal/NEON/Accelerate), and GPUs via CUDA, HIP, MUSA, Vulkan, and SYCL
  • Integer quantization from 1.5-bit to 8-bit for faster inference and reduced memory use
  • CPU+GPU hybrid inference to partially accelerate models larger than total VRAM
  • Built-in OpenAI-compatible REST API server (llama-server), with multimodal support
  • Download and run models directly from Hugging Face with the -hf flag

Getting started

Install a pre-built binary or build from source, then point llama.cpp at a GGUF model file or a Hugging Face repo.

Install llama.cpp

Install with a package manager (brew, nix, or winget), run it with Docker, download a pre-built binary from the releases page, or build from source. See the project's install and build guides for details.

bashbash
brew install llama.cpp

Run a model from the command line

Use llama-cli with a local GGUF file, or pass -hf to download and run a model straight from Hugging Face.

bashbash
# Use a local model file
llama-cli -m my_model.gguf

# Or download and run a model directly from Hugging Face
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

Launch the OpenAI-compatible server

Start llama-server to expose a REST API that OpenAI-compatible clients can call.

bashbash
llama-server -hf ggml-org/gemma-3-1b-it-GGUF

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Run open models offline on a laptop or workstation without a Python stack
  • Serve a local OpenAI-compatible API for apps and agents during development
  • Fit larger models on limited hardware using quantization and CPU+GPU hybrid inference
  • Run inference on Apple Silicon or non-NVIDIA GPUs via Metal, Vulkan, HIP, or SYCL

How llama.cpp compares

llama.cpp alongside other open-source local runtimes tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Ollama★ 175kA developer-friendly tool that downloads and runs local LLMs from the terminal with a built-in OpenAI-compatible API.
llama.cpp★ 117kRun LLMs in C/C++ on CPU, Apple Silicon, and GPU with low memory use
GPT4All★ 77.4kGPT4All is a free desktop app and Python client that runs large language models locally on your own computer, with no API calls or GPU required.
LocalAI★ 47kA self-hosted server that exposes an OpenAI-compatible API for running text, vision, voice, and image models on local hardware.
Jan★ 43.1kAn open-source desktop app that runs LLMs fully offline as a ChatGPT-style assistant on your own computer.
llamafile★ 25kA Mozilla project that packages a model and its runtime into one executable file you can copy and run on any OS.
MLC LLM★ 22.8kA machine-learning compiler that builds and runs LLMs across browsers, phones, and desktops using TVM-based code generation.
KTransformers★ 17.3kA framework for running large Mixture-of-Experts models locally by splitting work between CPU and GPU to fit limited VRAM.