In plain English
A large language model is, underneath, just a giant pile of numbers called weights — billions of them. Each weight is normally stored as a fairly precise number that takes up 16 bits of memory. Quantization is the trick of storing those same weights using fewer bits each — often just 4 — so the whole model takes up roughly a quarter of the space. Smaller model, same job, a fraction of the memory.
Here's the everyday analogy. Imagine a recipe that calls for "237.4 grams of flour." That's precise, but in your kitchen you'd just round it to "a cup." The cake still comes out fine. Quantization does this to every weight in the model: it rounds each precise 16-bit number to a coarser, lower-bit approximation. You lose a tiny bit of exactness, but the model still bakes the same cake — and now it fits in a much smaller bowl.
The number of bits is the headline. A weight in 16-bit precision can express a huge range of values very finely. Squeeze it down to 8-bit and you halve the memory. Squeeze it to 4-bit and you quarter it — that's the sweet spot most people use to run big models at home. The fewer the bits, the smaller and faster the model, and the more rounding error you accept in exchange.
Why it matters
The single hardest part of running a model on your own machine is memory. Not speed, not setup — memory. A model has to fit inside your GPU's video memory (VRAM) to run fast, and big models are enormous. A 70-billion-parameter model in its native 16-bit form needs about 140 GB of VRAM. No consumer graphics card has that. Without quantization, running a frontier-class open model at home is simply impossible.
The rough rule of thumb: memory needed (in GB) ≈ number of parameters (in billions) × bytes per weight. At 16-bit that's 2 bytes per weight; at 4-bit it's about 0.5 bytes. Watch what quantization does to the same 70B model:
| Precision | Bytes/weight | Size of a 70B model | Runs on |
|---|---|---|---|
| 16-bit (fp16) | ~2 | ~140 GB | Multi-GPU server only |
| 8-bit (int8) | ~1 | ~70 GB | High-end workstation |
| 4-bit (int4) | ~0.5 | ~35 GB | Two consumer GPUs / a Mac |
| ~2-bit | ~0.3 | ~21 GB | A single high-VRAM GPU |
Suddenly the impossible is merely hard. A 4-bit 70B model fitting in ~35 GB is within reach of a beefy gaming rig or an Apple Silicon Mac with unified memory. Smaller models benefit even more dramatically: a 7B model that needs ~14 GB at 16-bit drops to ~4 GB at 4-bit, which runs comfortably on a mid-range laptop GPU. This is the whole reason local LLMs are a thing you can actually do.
Memory isn't the only payoff. Smaller weights mean less data shuffling between memory and the chip, so quantized models also run faster and use less power. That matters everywhere: a phone running an AI feature offline, a cloud inference server packing more models onto each GPU, an edge device with no internet. Quantization is what replaced "you need a data center" with "you need a decent laptop."
How it works
To understand quantization, you need one idea: a number stored with more bits can represent more distinct values. A 16-bit number can pick from tens of thousands of finely-spaced values. A 4-bit number can only pick from 16 values total. Quantizing means mapping the original fine-grained weights onto that tiny menu of allowed values — and storing only which menu slot each weight landed in.
The mechanics, step by step. First you look at a group of weights and find their range — the smallest and largest values. Then you slice that range into evenly-spaced bins (16 of them for 4-bit). Each original weight gets rounded to the nearest bin and stored as a small whole number (0–15) saying which bin it's in. You also save one extra number per group, the scale factor, which records how to stretch those bin indices back into real values when the model runs.
That "per group" detail is the secret sauce. If you used one single range for the entire model, a few unusually large weights (called outliers) would stretch the range so wide that all the normal weights bunch into just two or three bins — destroying accuracy. So good quantizers split the weights into small blocks (say, 32 or 64 weights) and compute a separate scale factor for each block. This block-wise approach keeps the rounding tight where it matters. The smaller the block, the better the accuracy and the more scale factors you store.
The accuracy trade-off
Every bit you drop adds rounding error, and the model's outputs degrade slightly. The crucial finding from years of practice: that degradation is not linear. Going from 16-bit to 8-bit is almost free — the quality loss is usually unmeasurable. Going to 4-bit costs a little, often a barely-noticeable dip. Below 4-bit, quality starts falling off a cliff, faster for small models than large ones.
- Half the memory
- Quality loss ~zero
- Safe default if it fits
- Quarter the memory
- Small, usually fine loss
- The popular sweet spot
- Tiny footprint
- Noticeable quality drop
- Only worth it for huge models
This is why 4-bit is the community default: it's the most memory you can save before quality starts to genuinely hurt. And bigger models are more forgiving — a 4-bit 70B model often beats an 8-bit 13B model that takes the same space, because the larger model has so much redundancy that it shrugs off the rounding. When in doubt, a bigger model at lower precision usually wins.
GGUF, GPTQ, AWQ: the flavors you'll meet
When you go to download a model from Hugging Face, you won't see "quantization" as a single button. Instead you'll meet a handful of named methods and formats, each a different recipe for doing the rounding and packaging the result. They're not interchangeable — which one you pick depends on the tool you'll run the model in. Here's the cheat sheet.
| Name | What it is | Best for |
|---|---|---|
| GGUF | A file format (from the llama.cpp project) that packs quantized weights plus metadata into one file | Running on CPU, Mac, or mixed CPU/GPU — Ollama and llama.cpp use it |
| bitsandbytes | On-the-fly 4-bit / 8-bit quantization done at load time, no pre-processing | Quick Python use and QLoRA fine-tuning |
| GPTQ | A calibrated method that quantizes weights using a small sample dataset to minimize error | GPU inference with good 4-bit accuracy |
| AWQ | Activation-aware quantization that protects the most important weights from rounding | Fast, accurate 4-bit GPU inference |
You almost never have to quantize a model yourself. Prolific community members — names like TheBloke historically, and many others — publish ready-made quantized versions of popular models in every format and bit-width. You just download the variant that matches your hardware and tool. A typical GGUF repo offers a ladder of files from Q2_K (smallest, roughest) up through Q4_K_M (the popular balanced pick) to Q8_0 (biggest, nearly lossless).
Hands-on: load a 4-bit model
Talk is cheap — here's quantization in action. The simplest path in Python is bitsandbytes, which quantizes a model to 4-bit as it loads, no separate conversion step. Notice how little code it takes: a config object and one extra argument.
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
model_name = "Qwen/Qwen3-8B" # an 8B model: ~16 GB at 16-bit, ~5 GB at 4-bit
# Tell transformers to quantize the weights to 4-bit while loading.
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # a 4-bit type tuned for neural-net weights
bnb_4bit_compute_dtype="bfloat16", # math still runs in 16-bit for accuracy
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quant_config,
device_map="auto", # place layers on the GPU automatically
)
prompt = "Explain quantization in one sentence."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=60)
print(tokenizer.decode(output[0], skip_special_tokens=True))Two lines do the real work. load_in_4bit=True is the whole point — the weights land in your VRAM at a quarter of their normal size. bnb_4bit_compute_dtype="bfloat16" is the subtle bit: the weights are stored in 4-bit, but each one is briefly expanded back to 16-bit for the actual multiplication, then the result stays 16-bit. You save the memory of 4-bit while keeping most of the math accuracy of 16-bit.
If you'd rather not touch Python, the no-code route is even shorter. A tool like Ollama pulls a GGUF-quantized model and runs it with one command — the quantization is already baked into the file you download.
# Pulls a ready-made 4-bit GGUF build and runs it locally.
ollama run llama3.1:8b
# Or pick an explicit quantization level by tag:
ollama run llama3.1:8b-instruct-q4_K_MGoing deeper
The block-wise 4-bit picture above is the foundation. Once it clicks, here are the ideas that separate a casual user from someone who actually tunes deployments.
The outlier problem is the central research story. Transformer activations contain a small number of features with enormous magnitudes, and naive quantization gets wrecked by them. The landmark LLM.int8() work handled this by keeping those rare outlier dimensions in 16-bit while quantizing everything else to 8-bit. AWQ (activation-aware weight quantization) took a different angle: identify the ~1% of weights that matter most for the outputs and protect them from heavy rounding. Most modern accuracy gains come from cleverly deciding what not to quantize as hard, not from better rounding overall.
Weights vs. activations vs. the KV cache are three separate things you can quantize. Everything so far has been weight quantization — the stored model. But you can also quantize the activations (the intermediate numbers flowing through the network at runtime) and the KV cache (the memory of past tokens that grows with your context length). On long-context workloads the KV cache can dwarf the weights, so KV-cache quantization is a hot area for squeezing more conversation into the same VRAM.
Post-training quantization vs. quantization-aware training. Almost everything you'll touch is post-training quantization (PTQ): take a finished model and compress it, optionally with a small calibration dataset (that's what GPTQ and AWQ do). The heavier alternative is quantization-aware training (QAT), where the model is trained with the rounding simulated in the loop, so it learns weights that survive low precision gracefully. QAT gives the best low-bit accuracy but costs a full training run, so it's reserved for models that will ship at scale.
Quantization pairs with fine-tuning via QLoRA. You can't easily fine-tune a quantized model directly — the rounded weights aren't smooth enough to train. QLoRA solves this elegantly: freeze the base model in 4-bit, then train a tiny LoRA adapter in 16-bit on top. The frozen base stays small in memory while the trainable part stays accurate, which is how people fine-tune big models on a single consumer GPU. It's a direct descendant of the ideas on this page.
The honest open problems. Below 4-bit, accuracy is fragile and method-dependent — 2-bit and 3-bit quantization can be excellent on one model and unusable on another, and there's no clean rule yet for predicting which. Evaluating a quantized model is also genuinely hard: a quantized model can score the same on a benchmark while subtly degrading on rare edge cases or long outputs, so careful evaluation on your own tasks matters more than trusting a single accuracy number.
FAQ
What is quantization in an LLM in simple terms?
It's storing a model's weights with fewer bits each — typically dropping from 16-bit down to 4-bit — so the model takes up about a quarter of the memory. You accept a tiny bit of rounding error in exchange for a much smaller, faster model that can run on consumer hardware.
Does quantization make a model worse?
A little, but usually far less than people fear. Going from 16-bit to 8-bit is essentially free, and 4-bit costs only a small, often unnoticeable dip in quality — especially on larger models, which are more forgiving. Below about 4-bit the quality loss becomes real and can drop sharply.
What's the difference between GGUF, GPTQ, and AWQ?
GGUF is a file format (a container) used by llama.cpp and Ollama, ideal for CPU and Mac. GPTQ and AWQ are quantization techniques that calibrate on sample data to choose the rounding more carefully for better 4-bit accuracy on GPUs. You pick based on the tool you'll run the model in.
How much VRAM do I need for a 4-bit model?
A rough estimate is the parameter count in billions times about 0.5 GB, plus headroom for the context window. So a 7B model needs roughly 4–5 GB at 4-bit, and a 70B model needs around 35 GB. Always leave some spare VRAM so the model doesn't spill into slow system RAM.
Is quantization the same as fine-tuning or distillation?
No. Quantization only changes how an existing model's weights are stored — it doesn't retrain anything. Fine-tuning changes the weights to teach new behavior, and distillation trains a smaller model from scratch using a bigger one as a teacher. They're often combined, as in QLoRA, but they solve different problems.
Which quantization level should I download?
For most people, 4-bit (often labeled Q4_K_M in GGUF) is the sweet spot: big memory savings with minimal quality loss. Use 8-bit if you have the VRAM and want maximum fidelity, and only drop to 2–3 bit for very large models where it's the only way they'll fit.