Which GGUF File Should You Download? Reading a Hugging Face Quant Repo

Q: Which GGUF file should I download if I don't know what to pick?

Pick the `Q4_K_M` file for a model size that fits your memory with 1–2 GB to spare. It is the community default — the best balance of size, speed, and quality. For an 8B model that's about a 5 GB file, which suits a typical 8 GB GPU.

Q: Should I download a bigger model at low quant or a smaller model at high quant?

Usually the bigger model at lower quant wins. A 13B at `Q4_K_M` typically beats an 8B at `Q8_0` of similar file size. Spend a fixed memory budget on parameters first, then on bits — but don't go below about 3-bit, where quality drops off sharply.

Q: What's the difference between the Q4_0 and Q4_K_M files?

`Q4_0` is the older legacy quant scheme; `Q4_K_M` uses the modern K-quant method, which gives better quality at the same size. Always prefer the `_K` variant unless a specific tool requires the legacy format.

You will walk away able to open any GGUF repo and confidently grab the single right file for your machine instead of guessing.

BEGINNER11 MIN READUPDATED 2026-06-13

In plain English

You found a model you want to run on your own computer. You open its GGUF repository on Hugging Face and instead of one download button you get a wall of files: model-Q4_K_M.gguf, model-Q5_K_S.gguf, model-Q8_0.gguf, model-IQ3_XS.gguf, and maybe a few named ...00001-of-00003.gguf. Every one is the same model. They differ only in how aggressively the weights were squeezed down — the quant level.

Picking a GGUF to Download — illustration — Picking a GGUF to Download — qianwen-res.oss-cn-beijing.aliyuncs.com

Think of it like buying the same movie in different qualities: 4K, 1080p, 720p, and a tiny phone version. The 4K file is gorgeous but huge and your old laptop chokes on it. The phone version plays anywhere but looks soft. You are not picking a different movie — you are picking the largest quality that still fits and plays smoothly on the screen you actually own. A GGUF repo is exactly that shelf, and your "screen" is your memory budget.

So the whole task reduces to one question: how much memory do I have, and what is the biggest, best-quality file that fits inside it with a little room to spare? Once you can read the file names, the answer is usually obvious in about ten seconds.

Why it matters

Picking the wrong file is the single most common way a first local-LLM attempt goes wrong, and it fails in two opposite directions.

Too big → it crashes or crawls. If the file plus its working memory don't fit in your VRAM (GPU memory) or RAM, the model either refuses to load with an out-of-memory error, or it spills onto disk/CPU and generates at a painful one or two words per second. People conclude "local models are useless" when they simply grabbed the 4K file for a phone.
Too small → you throw away quality for no reason. If you have 24 GB of VRAM and download a tiny 3-bit quant "to be safe," you get a noticeably dumber model — more mistakes, worse instruction-following — while gigabytes of your hardware sit idle. The waste is invisible, so people never realize they left quality on the table.
Wrong repo entirely. Grabbing a base model when you wanted a chat model, or a sketchy re-upload with a broken chat template, produces a model that rambles or ignores your instructions even though the quant was fine.

Getting this right matters because the file you download decides three things at once: whether the model loads at all, how fast it answers, and how smart it feels. There is no "reload to fix it" — you have to redownload several gigabytes. Learning to read the repo once saves you from that loop forever, and it's the gateway skill for every local tool: Ollama, LM Studio, and raw llama.cpp all consume these same files.

How to read a GGUF file name

A GGUF file name encodes everything you need in a short code at the end. Once you can decode it, the shelf stops being scary. Take Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf apart:

Piece	What it tells you
`Llama-3.1`	The model family and version.
`8B`	Parameter count — 8 billion. This roughly sets the file size.
`Instruct`	It was tuned to follow chat instructions (vs a raw `base` model). For chatting, you want this.
`Q4`	Quantization bits — about 4 bits per weight. Lower = smaller + faster + slightly dumber.
`_K`	Uses the modern K-quant scheme (smarter than the old legacy quants). Almost everything today is `_K`.
`_M`	Size tier within that level: `S` = small, `M` = medium, `L` = large. `M` is the usual sweet spot.

So Q4_K_M reads as: 4-bit, K-quant, medium tier. That combination is the community default — the best balance of size and quality for most people, most of the time. When in doubt, this is the file to grab.

The decision itself is a short funnel. You start from the memory you have, subtract a little overhead, and that tells you the biggest file you can afford. Then you pick the highest-quality quant whose file size lands under that number.

// From memory budget to one file

Check memoryGPU VRAM, or RAM if CPU/MacLeave headroommodel needs ~1-2 GB extra to runPick quant tierbiggest that fits → Q4_K_M defaultConfirm it's Instructchat model, right uploaderDownload that fileone .gguf (or its split parts)

The key rule for fit: the file should be comfortably smaller than your memory, not equal to it. Running a model needs extra room beyond the weights — for the context (your conversation) and the working buffers (the KV cache). A good habit is to leave roughly 1–2 GB of headroom on top of the file size. A 7 GB file will not run happily on a GPU with exactly 8 GB of VRAM.

Map your memory to a quant

Here is the part everyone actually wants: a lookup table. "Memory" means your GPU VRAM if you run on a graphics card, or your system RAM if you run on CPU or an Apple Silicon Mac (where RAM is shared with the GPU). Find your budget, find a model size that fits, and read off the quant. Sizes below are approximate for a typical _K_M quant and shift a little per model.

Your memory	Comfortable model + quant	Approx file size
6–8 GB	7–8B at Q4_K_M	~5 GB
10–12 GB	8B at Q6_K, or 13–14B at Q4_K_M	~7–9 GB
16 GB	13–14B at Q5_K_M	~10 GB
24 GB	32B at Q4_K_M, or 8B at Q8_0	~18 GB
48 GB	70B at Q4_K_S/M (tight)	~38–40 GB
64 GB+ RAM (CPU/Mac)	70B at Q4_K_M comfortably	~40 GB

How to read this in practice. Suppose you have an 8 GB graphics card. You scan the repo, you know you want an Instruct model, and you reach for the largest file that stays under ~6–6.5 GB (leaving headroom). For an 8B model that is almost always Q4_K_M. You download exactly one file and you are done.

Now the ranking within your budget — which quant beats which. From best quality to smallest, the practical ladder looks like this. Climb as high as your memory allows; the gains above Q5 are small, and the losses below Q3 are real.

// Quant ladder — top = best quality, bottom = smallest

Q8_0near-perfect, ~2x the size of Q4 — only if memory is plentifulQ6_Kexcellent, hard to tell from full precisionQ5_K_Mgreat quality, a common step up from Q4Q4_K_Mthe default sweet spot — pick this if unsureQ3_K_M / IQ3noticeably weaker; only to squeeze a bigger model inQ2_K / IQ2last resort — quality drops sharply

Split GGUF files (the ...00001-of-00003 parts)

For big models you will see names like model-Q4_K_M-00001-of-00003.gguf, ...-00002-of-00003.gguf, ...-00003-of-00003.gguf. These are not different choices and they are not alternatives. They are one file that was cut into pieces because Hugging Face caps single files at 50 GB. A 70B model can exceed that, so the uploader splits it.

The rule is simple: download every part of the set, and put them all in the same folder. Don't mix parts from different quants. If you choose Q4_K_M, grab all the Q4_K_M-0000X-of-0000N files and nothing else.

llama.cpp, Ollama, LM Studio (recent versions): just point the loader at the first part (...00001-of-0000N.gguf). The loader detects the rest in the same folder and stitches them automatically. You do not need to merge anything by hand.
If a tool insists on a single file: llama.cpp ships a small utility, llama-gguf-split, that merges the parts back into one file. But for everyday use you rarely need it — modern loaders read the parts directly.
Pulling from Ollama's own library? You never see split parts at all; ollama pull handles the whole download and assembly for you. Splits only show up when you download raw files from a Hugging Face repo.

Pick the right repo and uploader

Two repos can offer the "same" model and give very different results. Before you even look at quant names, make sure you are in the right repository.

Instruct vs base — almost always pick Instruct

A base (or pretrained) model only continues text; ask it a question and it may just write more questions. An Instruct (or Chat / -it) model was fine-tuned to follow instructions and hold a conversation. For anything interactive, you want the Instruct variant. Base models are for fine-tuning and research. If the model rambles and ignores you, the most likely cause is that you grabbed the base weights by accident.

Official repo vs community re-quantizer

Many model makers don't publish GGUFs themselves — they ship the full-precision weights, and the community converts them. So you'll often download from a trusted re-quantizer rather than the original lab. That's normal and expected. The well-known, reputable GGUF uploaders the community relies on include bartowski, TheBloke (older but historically the standard), and the unsloth and ggml-org organizations. A repo from one of these, with a clear model card and a full ladder of quants, is a safe bet.

Green flags (trust)	Red flags (be careful)
Known uploader (bartowski, unsloth, ggml-org, official lab)	Anonymous account, no other models, no README
Full range of quants (Q2 → Q8) listed	Only one odd quant, or sizes that look wrong
Model card explains the source + chat template	No description, no link to the original model
Recent upload date matching the model's release	Suspiciously old date for a brand-new model

Going deeper

Once the basic pick is second nature, a few finer points let you squeeze out the last bit of quality or fit.

IQ quants and imatrix. Files starting with IQ (like IQ3_XS, IQ2_M) use a newer importance-matrix method that preserves quality better at very low bit counts than the older Q3/Q2. The trade-off: they can be a little slower on some hardware, especially pure CPU. Rule of thumb — at 4 bits and above, plain Q4_K_M/Q5_K_M is great; below 4 bits, an IQ quant usually beats the equivalent Q quant. Many _K quants today are also built with an imatrix even when the name doesn't say IQ; the model card will mention it.

K-quants vs legacy quants. You may still see old names like Q4_0 or Q4_1 without the _K. These are the legacy scheme and are strictly worse than the _K versions at the same size. Unless a tool specifically requires them, always prefer the _K variant.

Splitting layers across GPU and CPU. You are not limited to files that fit entirely in VRAM. With GPU offloading, llama.cpp can put as many layers as fit on the GPU and run the rest on CPU. That lets a 24 GB card run, say, a 70B model slowly by keeping part of it in system RAM. So a file a bit larger than your VRAM is sometimes still usable — just slower. This is also why Macs with lots of unified memory punch above their weight; see running LLMs on a Mac.

Context costs memory too. The numbers in the table above assume a modest context. If you load a very long context window, the KV cache grows and eats VRAM on top of the weights, which can push a file that "fit" into out-of-memory territory. If you plan to use long contexts, drop one quant tier or one model size to leave room.

Where to go next. GGUF is one of several quant formats; if you run on a high-end GPU you might compare it with GPTQ and AWQ in GGUF vs GPTQ vs AWQ. And once you have your file, the practical next step is loading it — start with how to run Ollama or reading a model card to confirm the right prompt format. The durable habit: decode the name, match it to your memory with headroom, confirm it's the Instruct build from a trusted uploader, and grab every part if it's split.

FAQ

Which GGUF file should I download if I don't know what to pick?

Pick the Q4_K_M file for a model size that fits your memory with 1–2 GB to spare. It is the community default — the best balance of size, speed, and quality. For an 8B model that's about a 5 GB file, which suits a typical 8 GB GPU.

How do I know which quant fits my VRAM?

Compare the file size on the repo to your GPU VRAM (or system RAM if you run on CPU/Mac), and leave 1–2 GB of headroom for context and working buffers. As a rough rule, a Q4_K_M file is about half a gigabyte per billion parameters, so an 8B is ~5 GB and a 70B is ~40 GB.

What are the split GGUF files like 00001-of-00003?

They are one model cut into parts because Hugging Face limits single files to 50 GB. Download all the parts for your chosen quant into the same folder, then point your loader at the first part (...00001-of-...); llama.cpp, Ollama, and LM Studio stitch the rest automatically.

Is bartowski's GGUF safe to download, and which one do I pick?

Yes — bartowski is one of the most trusted community GGUF uploaders, alongside unsloth, ggml-org, and TheBloke. Within their repo, pick by your memory budget: Q4_K_M as a default, step up to Q5_K_M/Q6_K if you have room, or down to an IQ3 quant only if you must fit a bigger model.

Should I download a bigger model at low quant or a smaller model at high quant?

Usually the bigger model at lower quant wins. A 13B at Q4_K_M typically beats an 8B at Q8_0 of similar file size. Spend a fixed memory budget on parameters first, then on bits — but don't go below about 3-bit, where quality drops off sharply.

What's the difference between the Q4_0 and Q4_K_M files?

Q4_0 is the older legacy quant scheme; Q4_K_M uses the modern K-quant method, which gives better quality at the same size. Always prefer the _K variant unless a specific tool requires the legacy format.

// In plain English

// Why it matters

// How to read a GGUF file name

// Map your memory to a quant

// Split GGUF files (the ...00001-of-00003 parts)

// Pick the right repo and uploader

Instruct vs base — almost always pick Instruct

Official repo vs community re-quantizer

// Going deeper

// FAQ

// Further reading

// Related

In plain English

Why it matters

How to read a GGUF file name

Map your memory to a quant

Split GGUF files (the ...00001-of-00003 parts)

Pick the right repo and uploader

Going deeper

FAQ

Further reading

Related