AI/TLDR

Which GGUF File Should You Download? Reading a Hugging Face Quant Repo

You will walk away able to open any GGUF repo and confidently grab the single right file for your machine instead of guessing.

BEGINNER11 MIN READUPDATED 2026-06-13

In plain English

You found a model you want to run on your own computer. You open its GGUF repository on Hugging Face and instead of one download button you get a wall of files: model-Q4_K_M.gguf, model-Q5_K_S.gguf, model-Q8_0.gguf, model-IQ3_XS.gguf, and maybe a few named ...00001-of-00003.gguf. Every one is the same model. They differ only in how aggressively the weights were squeezed down — the quant level.

Picking a GGUF to Download — illustration
Picking a GGUF to Download — qianwen-res.oss-cn-beijing.aliyuncs.com

Think of it like buying the same movie in different qualities: 4K, 1080p, 720p, and a tiny phone version. The 4K file is gorgeous but huge and your old laptop chokes on it. The phone version plays anywhere but looks soft. You are not picking a different movie — you are picking the largest quality that still fits and plays smoothly on the screen you actually own. A GGUF repo is exactly that shelf, and your "screen" is your memory budget.

So the whole task reduces to one question: how much memory do I have, and what is the biggest, best-quality file that fits inside it with a little room to spare? Once you can read the file names, the answer is usually obvious in about ten seconds.

Why it matters

Picking the wrong file is the single most common way a first local-LLM attempt goes wrong, and it fails in two opposite directions.

  • Too big → it crashes or crawls. If the file plus its working memory don't fit in your VRAM (GPU memory) or RAM, the model either refuses to load with an out-of-memory error, or it spills onto disk/CPU and generates at a painful one or two words per second. People conclude "local models are useless" when they simply grabbed the 4K file for a phone.
  • Too small → you throw away quality for no reason. If you have 24 GB of VRAM and download a tiny 3-bit quant "to be safe," you get a noticeably dumber model — more mistakes, worse instruction-following — while gigabytes of your hardware sit idle. The waste is invisible, so people never realize they left quality on the table.
  • Wrong repo entirely. Grabbing a base model when you wanted a chat model, or a sketchy re-upload with a broken chat template, produces a model that rambles or ignores your instructions even though the quant was fine.

Getting this right matters because the file you download decides three things at once: whether the model loads at all, how fast it answers, and how smart it feels. There is no "reload to fix it" — you have to redownload several gigabytes. Learning to read the repo once saves you from that loop forever, and it's the gateway skill for every local tool: Ollama, LM Studio, and raw llama.cpp all consume these same files.

How to read a GGUF file name

A GGUF file name encodes everything you need in a short code at the end. Once you can decode it, the shelf stops being scary. Take Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf apart:

PieceWhat it tells you
Llama-3.1The model family and version.
8BParameter count — 8 billion. This roughly sets the file size.
InstructIt was tuned to follow chat instructions (vs a raw base model). For chatting, you want this.
Q4Quantization bits — about 4 bits per weight. Lower = smaller + faster + slightly dumber.
_KUses the modern K-quant scheme (smarter than the old legacy quants). Almost everything today is _K.
_MSize tier within that level: S = small, M = medium, L = large. M is the usual sweet spot.

So Q4_K_M reads as: 4-bit, K-quant, medium tier. That combination is the community default — the best balance of size and quality for most people, most of the time. When in doubt, this is the file to grab.

The decision itself is a short funnel. You start from the memory you have, subtract a little overhead, and that tells you the biggest file you can afford. Then you pick the highest-quality quant whose file size lands under that number.

The key rule for fit: the file should be comfortably smaller than your memory, not equal to it. Running a model needs extra room beyond the weights — for the context (your conversation) and the working buffers (the KV cache). A good habit is to leave roughly 1–2 GB of headroom on top of the file size. A 7 GB file will not run happily on a GPU with exactly 8 GB of VRAM.

Map your memory to a quant

Here is the part everyone actually wants: a lookup table. "Memory" means your GPU VRAM if you run on a graphics card, or your system RAM if you run on CPU or an Apple Silicon Mac (where RAM is shared with the GPU). Find your budget, find a model size that fits, and read off the quant. Sizes below are approximate for a typical _K_M quant and shift a little per model.

Your memoryComfortable model + quantApprox file size
6–8 GB7–8B at Q4_K_M~5 GB
10–12 GB8B at Q6_K, or 13–14B at Q4_K_M~7–9 GB
16 GB13–14B at Q5_K_M~10 GB
24 GB32B at Q4_K_M, or 8B at Q8_0~18 GB
48 GB70B at Q4_K_S/M (tight)~38–40 GB
64 GB+ RAM (CPU/Mac)70B at Q4_K_M comfortably~40 GB

How to read this in practice. Suppose you have an 8 GB graphics card. You scan the repo, you know you want an Instruct model, and you reach for the largest file that stays under ~6–6.5 GB (leaving headroom). For an 8B model that is almost always Q4_K_M. You download exactly one file and you are done.

Now the ranking within your budget — which quant beats which. From best quality to smallest, the practical ladder looks like this. Climb as high as your memory allows; the gains above Q5 are small, and the losses below Q3 are real.

Split GGUF files (the ...00001-of-00003 parts)

For big models you will see names like model-Q4_K_M-00001-of-00003.gguf, ...-00002-of-00003.gguf, ...-00003-of-00003.gguf. These are not different choices and they are not alternatives. They are one file that was cut into pieces because Hugging Face caps single files at 50 GB. A 70B model can exceed that, so the uploader splits it.

The rule is simple: download every part of the set, and put them all in the same folder. Don't mix parts from different quants. If you choose Q4_K_M, grab all the Q4_K_M-0000X-of-0000N files and nothing else.

  • llama.cpp, Ollama, LM Studio (recent versions): just point the loader at the first part (...00001-of-0000N.gguf). The loader detects the rest in the same folder and stitches them automatically. You do not need to merge anything by hand.
  • If a tool insists on a single file: llama.cpp ships a small utility, llama-gguf-split, that merges the parts back into one file. But for everyday use you rarely need it — modern loaders read the parts directly.
  • Pulling from Ollama's own library? You never see split parts at all; ollama pull handles the whole download and assembly for you. Splits only show up when you download raw files from a Hugging Face repo.

Pick the right repo and uploader

Two repos can offer the "same" model and give very different results. Before you even look at quant names, make sure you are in the right repository.

Instruct vs base — almost always pick Instruct

A base (or pretrained) model only continues text; ask it a question and it may just write more questions. An Instruct (or Chat / -it) model was fine-tuned to follow instructions and hold a conversation. For anything interactive, you want the Instruct variant. Base models are for fine-tuning and research. If the model rambles and ignores you, the most likely cause is that you grabbed the base weights by accident.

Official repo vs community re-quantizer

Many model makers don't publish GGUFs themselves — they ship the full-precision weights, and the community converts them. So you'll often download from a trusted re-quantizer rather than the original lab. That's normal and expected. The well-known, reputable GGUF uploaders the community relies on include bartowski, TheBloke (older but historically the standard), and the unsloth and ggml-org organizations. A repo from one of these, with a clear model card and a full ladder of quants, is a safe bet.

Green flags (trust)Red flags (be careful)
Known uploader (bartowski, unsloth, ggml-org, official lab)Anonymous account, no other models, no README
Full range of quants (Q2 → Q8) listedOnly one odd quant, or sizes that look wrong
Model card explains the source + chat templateNo description, no link to the original model
Recent upload date matching the model's releaseSuspiciously old date for a brand-new model

Going deeper

Once the basic pick is second nature, a few finer points let you squeeze out the last bit of quality or fit.

IQ quants and imatrix. Files starting with IQ (like IQ3_XS, IQ2_M) use a newer importance-matrix method that preserves quality better at very low bit counts than the older Q3/Q2. The trade-off: they can be a little slower on some hardware, especially pure CPU. Rule of thumb — at 4 bits and above, plain Q4_K_M/Q5_K_M is great; below 4 bits, an IQ quant usually beats the equivalent Q quant. Many _K quants today are also built with an imatrix even when the name doesn't say IQ; the model card will mention it.

K-quants vs legacy quants. You may still see old names like Q4_0 or Q4_1 without the _K. These are the legacy scheme and are strictly worse than the _K versions at the same size. Unless a tool specifically requires them, always prefer the _K variant.

Splitting layers across GPU and CPU. You are not limited to files that fit entirely in VRAM. With GPU offloading, llama.cpp can put as many layers as fit on the GPU and run the rest on CPU. That lets a 24 GB card run, say, a 70B model slowly by keeping part of it in system RAM. So a file a bit larger than your VRAM is sometimes still usable — just slower. This is also why Macs with lots of unified memory punch above their weight; see running LLMs on a Mac.

Context costs memory too. The numbers in the table above assume a modest context. If you load a very long context window, the KV cache grows and eats VRAM on top of the weights, which can push a file that "fit" into out-of-memory territory. If you plan to use long contexts, drop one quant tier or one model size to leave room.

Where to go next. GGUF is one of several quant formats; if you run on a high-end GPU you might compare it with GPTQ and AWQ in GGUF vs GPTQ vs AWQ. And once you have your file, the practical next step is loading it — start with how to run Ollama or reading a model card to confirm the right prompt format. The durable habit: decode the name, match it to your memory with headroom, confirm it's the Instruct build from a trusted uploader, and grab every part if it's split.

FAQ

Which GGUF file should I download if I don't know what to pick?

Pick the Q4_K_M file for a model size that fits your memory with 1–2 GB to spare. It is the community default — the best balance of size, speed, and quality. For an 8B model that's about a 5 GB file, which suits a typical 8 GB GPU.

How do I know which quant fits my VRAM?

Compare the file size on the repo to your GPU VRAM (or system RAM if you run on CPU/Mac), and leave 1–2 GB of headroom for context and working buffers. As a rough rule, a Q4_K_M file is about half a gigabyte per billion parameters, so an 8B is ~5 GB and a 70B is ~40 GB.

What are the split GGUF files like 00001-of-00003?

They are one model cut into parts because Hugging Face limits single files to 50 GB. Download all the parts for your chosen quant into the same folder, then point your loader at the first part (...00001-of-...); llama.cpp, Ollama, and LM Studio stitch the rest automatically.

Is bartowski's GGUF safe to download, and which one do I pick?

Yes — bartowski is one of the most trusted community GGUF uploaders, alongside unsloth, ggml-org, and TheBloke. Within their repo, pick by your memory budget: Q4_K_M as a default, step up to Q5_K_M/Q6_K if you have room, or down to an IQ3 quant only if you must fit a bigger model.

Should I download a bigger model at low quant or a smaller model at high quant?

Usually the bigger model at lower quant wins. A 13B at Q4_K_M typically beats an 8B at Q8_0 of similar file size. Spend a fixed memory budget on parameters first, then on bits — but don't go below about 3-bit, where quality drops off sharply.

What's the difference between the Q4_0 and Q4_K_M files?

Q4_0 is the older legacy quant scheme; Q4_K_M uses the modern K-quant method, which gives better quality at the same size. Always prefer the _K variant unless a specific tool requires the legacy format.

Further reading