In plain English
When you install Ollama, LM Studio, or Jan and run an AI model on your own laptop, there is a piece of software doing the actual work underneath all those friendly interfaces. That piece of software is almost certainly llama.cpp. It is the open-source C++ library that actually loads a model file, feeds your prompt in, and streams tokens back. Everything else is built on top of it.
Think of llama.cpp as the engine in a car. When you drive a car you interact with the steering wheel, pedals, and dashboard — not with the engine directly. Ollama is the dashboard. LM Studio is a fancier dashboard with more gauges. But when you press the gas pedal, it is the engine that does the work. llama.cpp is the engine.
More precisely, llama.cpp is an inference engine — software that takes a trained model's weights and performs the forward-pass math on your CPU or GPU to produce output text. It was created by Georgi Gerganov and first released on March 10, 2023. Within a month it had 19,000 GitHub stars. Today it has become what its own community calls "the de facto standard" at the core of almost every local AI tool.
Why it matters
Before llama.cpp, running a large language model locally was genuinely hard. You needed PyTorch, CUDA, a specific Python environment, and usually a datacenter-grade GPU. The moment Meta released Llama in early 2023 and the weights leaked, people desperately wanted to run it on consumer hardware — but nobody had built the tooling yet. Gerganov built llama.cpp in a weekend.
The key insight was this: if you quantize the model weights down to 4-bit integers instead of 32-bit floats, the model file shrinks by roughly 8x. A 7-billion-parameter model that would normally need 28 GB of RAM suddenly fits in under 4 GB. And integer math on a CPU is fast enough that a modern laptop can generate readable text at a usable speed — no GPU required.
That single insight — quantize aggressively, run on CPU — unlocked local AI for hundreds of millions of computers that would never qualify for GPU-accelerated inference. llama.cpp is why local AI is accessible today, not just aspirational.
- No Python, no CUDA, no dependencies. llama.cpp compiles to a single binary. You can run it on Windows, macOS, or Linux without installing a Python environment.
- It runs on nearly any hardware. CPU, NVIDIA GPU via CUDA, AMD GPU via ROCm/HIP, Apple Silicon via Metal, Qualcomm NPUs, and even Android and ChromeOS (since December 2025). If the chip exists, llama.cpp probably supports it.
- It defined the GGUF format. Every model you download from Hugging Face as a
.gguffile is designed to be loaded by llama.cpp. GGUF is the universal packaging standard for local models. - It ships a ready-to-use HTTP server.
llama-serverexposes an OpenAI-compatible API onlocalhost, so any app written against the OpenAI API can point at llama.cpp instead. - The whole ecosystem sits on top of it. Ollama, LM Studio, Jan, and dozens of other tools are effectively polished interfaces around llama.cpp. Understanding the engine helps you understand all of them.
How it works
At its core, llama.cpp does one thing: it loads a GGUF model file into memory and runs the transformer math to predict one token at a time. That sounds simple, but there is a lot of engineering making it fast on hardware that was never built for AI.
GGUF: the model file format
GGUF stands for GGML Universal File (GGML is the tensor library under llama.cpp). Introduced in August 2023 to replace the older GGML format, a GGUF file is a single binary that contains everything needed to run a model: the architecture description, tokenizer vocabulary, quantized weights, and all metadata. You download one file and llama.cpp can run it — no separate config files, no tokenizer JSON, nothing else.
Quantization: how big models fit on small hardware
Full-precision model weights are stored as 32-bit or 16-bit floats. Quantization reduces each weight to fewer bits — commonly 4 bits — which shrinks the file and the memory footprint dramatically. llama.cpp implements multiple quantization schemes. The most important are the K-quants (Q4_K_M, Q5_K_M, Q6_K, and others), which group weights into blocks with shared scaling factors to preserve quality at low bit widths.
| Quantization | Bits per weight | Approx size (7B model) | Quality vs full precision |
|---|---|---|---|
| Q2_K | ~2.6 bits | ~2.7 GB | Noticeably degraded — avoid for most uses |
| Q4_K_M | ~4.5 bits | ~4.1 GB | Excellent — the most popular sweet spot |
| Q5_K_M | ~5.5 bits | ~5.0 GB | Very close to full — good if you have RAM |
| Q6_K | ~6.6 bits | ~5.9 GB | Near-indistinguishable from full precision |
| Q8_0 | ~8 bits | ~7.7 GB | Essentially lossless — mainly for comparison |
GPU offloading: the best of both worlds
llama.cpp does not require a GPU, but it can use one. With the --n-gpu-layers flag you tell it how many transformer layers to offload to the GPU. If your GPU VRAM can hold the whole model, everything runs at GPU speed. If VRAM is limited, you offload as many layers as fit and run the rest on CPU — a hybrid mode that is slower than pure-GPU but much faster than pure-CPU. This is one of the features that makes llama.cpp uniquely practical on consumer hardware.
llama.cpp vs Ollama: what is the difference
This is the most common question newcomers ask. The short answer: Ollama runs llama.cpp under the hood. Ollama is a process manager and model-library wrapper built on top of llama.cpp's inference engine. When you run ollama run llama3.2, Ollama downloads a GGUF file, configures llama.cpp appropriately for your hardware, starts the server, and hands you a chat prompt. llama.cpp does the actual token generation.
So why would you ever use llama.cpp directly instead of Ollama? A few reasons:
- Maximum control. llama.cpp exposes dozens of flags that Ollama hides. You can set exact context length, KV cache size, batch size, rope scaling, and many other parameters that Ollama manages automatically.
- Any GGUF file. Ollama has a curated model library. llama.cpp will load any
.gguffile you point it at — including Unsloth quantizations and experimental models not yet in Ollama's registry. - Raw performance. Community benchmarks consistently show llama.cpp producing 2–8% more tokens per second than the equivalent Ollama setup, because Ollama adds a thin management layer.
- Embedded use. If you are building an application that bundles a model, you may link against the llama.cpp C library directly rather than depending on a separate Ollama daemon.
- Learning. Calling llama.cpp directly teaches you what the flags actually mean — context size, temperature, GPU layers — in a way that Ollama's abstractions hide.
| llama.cpp directly | Ollama (built on llama.cpp) | |
|---|---|---|
| Ease of use | Requires CLI flags and manual setup | Single command: ollama run <model> |
| Model management | You download and manage GGUF files yourself | Built-in library, auto-download |
| API | llama-server (OpenAI-compatible) | OpenAI-compatible, same format |
| Model source | Any GGUF file anywhere | Ollama registry + Hugging Face |
| Performance | Marginally faster (2–8%) | Tiny overhead from wrapper |
| Best for | Power users, developers, embedded use | Beginners, quick prototyping, day-to-day use |
Using llama.cpp directly
llama.cpp ships several command-line tools. The two you will use most are llama-cli for interactive chat and llama-server for running a local API. On macOS you can install both with a single Homebrew command. On Windows and Linux, you either download a prebuilt release from the GitHub releases page or build from source.
# Install via Homebrew (includes llama-cli and llama-server)
brew install llama.cpp
# Download a GGUF model from Hugging Face (example: Llama 3.2 3B Q4_K_M)
# Replace the URL with any GGUF file you want to run
curl -L -o llama-3.2-3b-q4_k_m.gguf \
"https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf"
# Chat interactively
llama-cli -m llama-3.2-3b-q4_k_m.gguf -cnv
# Or start the HTTP server (OpenAI-compatible API on port 8080)
llama-server -m llama-3.2-3b-q4_k_m.gguf --port 8080Once llama-server is running, you can call it exactly like the OpenAI API — from curl, from Python using the openai library, or from any other client. The endpoint is http://localhost:8080/v1/chat/completions.
from openai import OpenAI
# Point the OpenAI client at your local llama-server
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed", # llama-server ignores the key
)
response = client.chat.completions.create(
model="local", # llama-server ignores model name, uses the loaded file
messages=[
{"role": "user", "content": "Explain quantization in one sentence."}
],
)
print(response.choices[0].message.content)
# Everything runs on YOUR machine — no internet call leaves localhost.Going deeper
Once you are comfortable with the basics, here is where llama.cpp's depth becomes relevant.
The GGML tensor library underneath
llama.cpp is built on GGML, a pure-C tensor algebra library also written by Georgi Gerganov. GGML handles the low-level matrix multiplications and memory layouts that make inference fast. It is what allows llama.cpp to target so many different hardware backends — each backend (CUDA, Metal, Vulkan, OpenCL, SYCL) implements the GGML tensor operations in the most efficient way for that chip. When llama.cpp added AMD GPU support via HIP or added SYCL for Intel GPUs, that happened at the GGML layer. The separation of concerns means model code does not change when a new hardware target is added.
Speculative decoding
Since 2024, llama.cpp has supported speculative decoding: a small draft model generates candidate tokens at very low cost, and the main model verifies several at once in a single forward pass. When the draft is right (which happens often for predictable text like code), you get multiple tokens for the cost of one. On Apple Silicon M-series chips with Gemma 4 as the draft model, speculative decoding can deliver 2x or more generation speed on coding tasks.
The built-in web UI and llama-server features
As of 2026, llama-server ships with a production-quality SvelteKit web interface that launches automatically in the browser. It supports multimodal input (images, PDFs, audio), structured JSON output via JSON schema constraints, parallel conversation sessions, and Prometheus-compatible metrics at /metrics when you enable --metrics. llama.cpp also added support for the Anthropic Messages API format alongside its existing OpenAI-compatible endpoints, so you can use Claude-API-format clients against a local model.
When to reach beyond llama.cpp
llama.cpp is optimized for single-user, consumer-hardware inference. If your goal shifts to serving dozens or hundreds of concurrent users from a server GPU, you will eventually hit its limits. Server-grade inference engines like vLLM use PagedAttention and continuous batching to keep a datacenter GPU saturated with concurrent requests — techniques that llama.cpp does not implement. For personal use, offline use, or developer prototyping, llama.cpp is the right tool. For production serving at scale, the throughput-focused engines take over.
The quantization frontier keeps moving. New quantization schemes like IQ-quants (importance-aware quantization) and Unsloth's custom GGUF quantizations regularly push the quality/size frontier. When evaluating a model, look past the Q4_K_M default — newer schemes at the same bit width can offer meaningfully better perplexity. The scripts/quantize tool ships with llama.cpp if you want to quantize a raw model yourself from a Hugging Face safetensors checkpoint.
FAQ
What does llama.cpp actually do?
llama.cpp is an inference engine — it loads a GGUF model file into memory and runs the transformer math to generate text tokens one at a time. It is the software layer that turns a downloaded model file into something that can actually read your prompt and produce a response. Virtually every local AI app (Ollama, LM Studio, Jan) uses llama.cpp under the hood to do this work.
Do I need to know C++ to use llama.cpp?
No. llama.cpp ships prebuilt binaries (llama-cli and llama-server) that you use from the command line. You only need C++ knowledge if you want to compile it from source or embed it as a library in your own application. For most users, downloading a release binary and pointing it at a GGUF file is all that is required.
What is the difference between llama.cpp and Ollama?
Ollama is built on top of llama.cpp and uses it as its inference engine. Ollama adds a model library, a simple command-line interface, and a background service — it makes llama.cpp much easier to use. If you use Ollama, you are already using llama.cpp. You would use llama.cpp directly when you need fine-grained control over inference parameters, want to load any GGUF file regardless of the Ollama library, or need to embed inference in a custom application.
What is a GGUF file and where do I get one?
GGUF is the model file format designed for llama.cpp. It packages the model weights, architecture metadata, and tokenizer vocabulary into a single binary file. You can download GGUF files from Hugging Face — search for any model name plus "GGUF" and you will find community-quantized versions. The Q4_K_M quantization variant is a good default for most hardware.
Can llama.cpp use my GPU?
Yes. llama.cpp supports NVIDIA GPUs via CUDA, AMD GPUs via ROCm/HIP, Apple Silicon via Metal, Intel GPUs via SYCL, and a Vulkan backend that works on most modern GPUs. You can also do partial GPU offloading with the --n-gpu-layers flag, loading as many layers as fit in VRAM and running the rest on CPU — useful when the model is too large to fit entirely in GPU memory.
Is llama.cpp only for Llama models?
Despite the name, llama.cpp supports a very wide range of model architectures including Mistral, Gemma, Phi, Qwen, DeepSeek, Falcon, and many others — essentially any model that has been converted to GGUF format. The name comes from the original Meta Llama model it was built for, but the project grew into a general-purpose local inference engine.