AI/TLDR

What Is the GGUF Format?

You will understand what a GGUF file actually contains, how llama.cpp reads it, and how to decode cryptic names like Q4_K_M.

INTERMEDIATE11 MIN READUPDATED 2026-06-12

In plain English

When you download a local AI model from Hugging Face or Ollama, you will almost always get a file ending in .gguf. That file contains everything the model needs to run — billions of compressed weight numbers, the vocabulary the model uses to read and write text, and dozens of configuration values — all packed into one self-contained binary file. GGUF is the name of that packaging standard.

Think of a GGUF file the way you think of a .docx file for Word documents. A .docx is not just text — it bundles the text, fonts, styles, and metadata in one ZIP-based container that Word knows how to open. GGUF does the same job for quantized language models: it bundles compressed weights, tokenizer data, and architecture metadata in one file that llama.cpp — and every tool built on top of it, like Ollama and LM Studio — knows exactly how to read.

GGUF replaced an older format called GGML in August 2023. GGML files were fragile: if the llama.cpp code changed, old files broke and there was no clean way to tell which version a file needed. GGUF fixed that by embedding a version number and rich metadata directly inside the file, so the tool can check compatibility before trying to load anything.

Why it matters

Before GGUF, running a local model was a jigsaw puzzle. The weights lived in one file, the tokenizer in another, the architecture config in a third, and a fourth file explained which version of the code they needed. Updating llama.cpp could silently break every model you had. Sharing a model meant sharing a folder of interdependent files and hoping nothing was missing.

GGUF collapsed all of that into a single, versioned, self-describing file. The immediate practical consequences:

  • One file to move. Copy a .gguf file and the model is fully portable — no companion config files needed.
  • Instant compatibility check. The file's header contains a version number and architecture name; llama.cpp reads those first and refuses gracefully if the format is too old or too new.
  • Fast memory-mapping. Tensor data is aligned at 16-byte boundaries and laid out so the OS can mmap it directly — the file is mapped into virtual memory without a full read-into-RAM step, letting large models start quickly.
  • Ecosystem convergence. Every major local inference tool — Ollama, LM Studio, Jan, kobold.cpp, text-generation-webui — standardized on GGUF. Community quantizers publish a single format that works everywhere.

The ecosystem effect is the biggest payoff. Because everyone agreed on one format, a model quantized by one person runs in a dozen different tools without conversion. Hugging Face even renders a special GGUF browser that shows the embedded metadata before you download the file.

How the file is laid out

A GGUF file is a binary file divided into four sequential sections. They always appear in the same order, which is what lets llama.cpp seek to the tensor data with a single offset calculation rather than scanning the whole file.

The header

The file opens with four ASCII bytes: G, G, U, F. Any program can identify a GGUF file instantly by checking those four bytes (called the magic number). Immediately after come a 32-bit format version (currently 3), a 64-bit count of tensors, and a 64-bit count of metadata key-value pairs. The header is intentionally tiny so tools can read it in a single small read call.

The metadata block

After the header, the file stores a flat list of typed key-value pairs. Keys are namespaced strings like llama.context_length, tokenizer.ggml.model, or general.architecture. Values can be integers, floats, booleans, strings, or arrays. A 7B Llama model typically carries 25–40 metadata keys. This section is what makes the file self-describing — a tool can read architecture and tokenizer details without any external config file.

texttext
general.architecture        = "llama"
general.name                = "Meta-Llama-3-8B-Instruct"
llama.context_length        = 8192
llama.embedding_length      = 4096
llama.block_count           = 32
llama.attention.head_count  = 32
llama.rope.freq_base        = 500000.0
tokenizer.ggml.model        = "gpt2"
tokenizer.ggml.tokens       = ["<|begin_of_text|>", "!", ...]
gguf.version                = 3
general.quantization_version = 2

The tensor info array

Next is a compact index of every tensor in the model. For each tensor the file records: its name (e.g. blk.0.attn_q.weight), the number of dimensions and their sizes, the data type (which encodes the quantization level), and a 64-bit byte offset pointing to where in the tensor data section the actual bytes start. This index lets llama.cpp build a complete map of the model in memory before loading a single weight.

The tensor data section

The final and by far the largest section is the raw weight bytes. Tensors are packed back-to-back, each padded to a 16-byte alignment boundary. That alignment is the reason mmap works well: the OS can map each tensor directly to a physical page without copying. On macOS with Apple Silicon, this means a 4-bit 8B model can start answering in under two seconds even though the file is over 4 GB — the OS streams pages in on demand rather than loading everything upfront.

Decoding Q4_K_M, Q8_0, and the rest

When you browse Hugging Face for a GGUF model you will see filenames like Meta-Llama-3-8B-Instruct-Q4_K_M.gguf or mistral-7b-Q8_0.gguf. The suffix after the model name encodes the quantization level baked into that file. The naming scheme follows a consistent grammar once you know the key:

PartMeaningExample
QQuantized (every GGUF suffix starts here)Q4, Q8
Bit numberBits used per weight — lower = smaller + faster + more loss4 → ~0.5 bytes/weight; 8 → ~1 byte/weight
_0Original (legacy) block-wise quantization, one scale per blockQ4_0, Q8_0
_KK-quant: mixed-precision scheme that quantizes different layers at different levels for better qualityQ4_K, Q5_K, Q6_K
_S / _M / _LSmall / Medium / Large variant of the K-quant — trades a tiny bit of size for better qualityQ4_K_S, Q4_K_M, Q4_K_L

K-quants (introduced in 2023) are the key improvement over the legacy _0 formats. Instead of quantizing every layer at the same bit-width, the K-quant algorithm identifies which layers (typically the attention projection weights) are more sensitive to rounding and keeps them at a slightly higher precision, while compressing less sensitive layers harder. The result is a file that is almost as small as a naive 4-bit model but measurably more accurate.

The practical upshot: Q4_K_M is the default choice for almost everyone. Research comparing Q4_K_M to Q8_0 finds the perplexity gap is around 0.05 points on a 7B model — below the threshold a typical user notices in conversation. The VRAM saving, however, is roughly 47%: a 7B model drops from ~7.7 GB to ~4.1 GB. For larger models the trade-off is even more compelling because you may simply not have the VRAM to run Q8_0 at all.

Using GGUF in practice

Every major local inference tool speaks GGUF natively. The right choice depends on whether you want a graphical interface, API access, or direct programmatic control.

Ollama — simplest path

Ollama wraps llama.cpp and manages GGUF files for you. Models are pulled from a registry and run with a single command. You can also load any .gguf file directly by writing a one-line Modelfile.

bashbash
# Pull and run a model (Ollama selects Q4_K_M by default)
ollama run llama3.1:8b

# Load a specific GGUF file from disk
cat > Modelfile <<'EOF'
FROM ./Meta-Llama-3-8B-Instruct-Q4_K_M.gguf
EOF
ollama create my-llama -f Modelfile
ollama run my-llama

llama.cpp directly — full control

If you want low-level access — custom sampling, specific thread counts, mixed CPU/GPU inference — you can call llama.cpp's llama-cli binary directly. The --n-gpu-layers flag offloads tensors to the GPU one layer at a time, which is useful when your model is slightly too large to fit entirely in VRAM.

bashbash
# Build llama.cpp (one-time)
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
cmake -B build -DGGML_CUDA=ON   # or -DGGML_METAL=ON for Mac
cmake --build build --config Release -j

# Run inference — offload 28 of 32 layers to GPU, keep 4 on CPU
./build/bin/llama-cli \
  --model Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
  --n-gpu-layers 28 \
  --ctx-size 4096 \
  --prompt "Explain the GGUF format in one paragraph."

Python via llama-cpp-python

pythonpython
from llama_cpp import Llama

# n_gpu_layers=-1 means offload everything to GPU
llm = Llama(
    model_path="./Meta-Llama-3-8B-Instruct-Q4_K_M.gguf",
    n_ctx=4096,
    n_gpu_layers=-1,
)

response = llm(
    "What is the GGUF format?",
    max_tokens=200,
    echo=False,
)
print(response["choices"][0]["text"])

Going deeper

The one-file design and fast mmap loading are the user-facing wins. Once you're comfortable with that, here are the details that matter for more advanced use.

GGUF split files handle models that exceed the practical single-file size limit for some filesystems. A 70B model at Q4_K_M is around 41 GB — above the 4 GB FAT32 limit and awkward to move on some systems. llama.cpp supports splitting a model into a numbered sequence of .gguf shards (e.g. model-00001-of-00004.gguf) that are loaded together at runtime. The metadata and tensor index live in the first shard.

Importance-matrix (imatrix) quantization is a newer technique that produces better Q4 and Q3 files without making them larger. The quantizer runs a small calibration dataset through the model and records which weight channels have the largest activation magnitudes; those channels are quantized more carefully. The result is a file whose filename sometimes includes -imat and that scores 1–3 perplexity points better than a standard K-quant at the same bit-width. When available, prefer imat variants.

Converting your own model to GGUF requires the convert_hf_to_gguf.py script in the llama.cpp repository, which reads a Hugging Face safetensors model and writes a full-precision GGUF. You then run llama-quantize on that file to produce whatever Q-level you want. The two-step process (convert then quantize) keeps the conversion script simple and lets you re-quantize to different levels without re-converting.

GGUF metadata is extensible. The key-value store has no fixed schema — any tool can add its own namespaced keys. This is how adapter files and fine-tuned models embed their LoRA weights or system-prompt overrides directly into the GGUF, making them just as self-contained as a base model. Hugging Face's GGUF parser exposes all metadata before you download, so you can inspect the architecture and quantization type of any file in the repository browser.

The GGML library underneath is the tensor computation engine that llama.cpp runs on. GGUF takes its name from it — the letters stand for GGML Universal Format. GGML handles the backend-specific kernels (CPU via BLAS, CUDA for NVIDIA, Metal for Apple Silicon, Vulkan for cross-platform GPU). When you pick a build flag like -DGGML_CUDA=ON, you're telling GGML which backend to compile. GGUF files are backend-agnostic — the same file runs on any hardware GGML supports.

FAQ

What does GGUF stand for?

GGUF stands for GGML Universal Format (sometimes described as GPT-Generated Unified Format in older sources). It was created by the llama.cpp project in August 2023 to replace the fragile GGML format with a self-describing, versioned binary container for quantized models.

What is the difference between GGUF and GGML?

GGML was the older file format used by llama.cpp before August 2023. It had no version number embedded in the file, so code changes could silently break existing model files with no clear error. GGUF replaced it with a proper header (magic bytes + version), a rich metadata block, and a clean tensor index — making files backward-compatible and self-describing. All modern tools use GGUF; GGML files are obsolete.

Is Q4_K_M good enough for coding tasks?

Yes, for most coding use cases. Benchmarks consistently show Q4_K_M matches or nearly matches Q8_0 on HumanEval and similar coding benchmarks — the 4-bit rounding error is spread across billions of weights and rarely collapses into a wrong function call or logic error. If you are working on highly precise numerical code or very long multi-file context, stepping up to Q5_K_M or Q6_K is a reasonable hedge.

Why do GGUF files load so fast compared to other formats?

GGUF's tensor data section is aligned to 16-byte boundaries specifically to enable memory-mapping (mmap). Instead of reading the entire file into RAM, the OS maps the file into virtual address space and loads pages on demand. The first token may arrive after only a fraction of the file has been physically read. Formats that require deserialization (like some PyTorch formats) cannot do this and must load the whole model before inference starts.

Can I run a GGUF model on CPU only, with no GPU?

Yes, and this is one of GGUF's main use cases. llama.cpp was designed from the start to run on CPU. Performance is much slower than GPU — expect 2–8 tokens per second for a 7B model on a modern laptop CPU — but it works without any graphics hardware. Apple Silicon Macs are a special case: they use unified memory, so the GPU and CPU share the same RAM pool, and GGUF models run on the GPU by default via the Metal backend.

What tools can open GGUF files?

llama.cpp is the reference implementation. Ollama, LM Studio, Jan, kobold.cpp, text-generation-webui, and Open WebUI all use llama.cpp under the hood and open GGUF natively. The llama-cpp-python library exposes GGUF loading from Python. Hugging Face Hub renders GGUF metadata in the browser so you can inspect a file before downloading it.

Further reading