In plain English
When you download a local model — say, a 7B Llama or a 70B Qwen — you're downloading a file full of numbers called weights. To run the model, every single one of those numbers has to sit in memory so the GPU or CPU can read it billions of times per second while generating each token. If the weights don't fit, the model either refuses to start or crawls to a painful few tokens per minute.
Think of it like loading a spreadsheet into a calculator. The calculator (your GPU) is extremely fast, but its working surface (VRAM) is small. If the spreadsheet (the model weights) is bigger than that surface, you have two choices: shrink the spreadsheet, or spill the overflow onto the desk (system RAM) and accept that the calculator has to keep reaching over to grab pages — which is much slower.
The two memory pools you care about are:
- VRAM (video RAM, on-GPU memory) — fast, limited. Typical consumer GPUs carry 8 GB to 24 GB. RTX 5090 is an outlier at 32 GB. This is where you want the model to live.
- System RAM — slower path to the GPU, much larger. Most workstations have 32–128 GB. Used for CPU inference or to hold layers that overflow VRAM.
Why it matters
Pick the wrong hardware and you get one of three failure modes: the model doesn't load at all, it runs but produces tokens so slowly it's unusable (below ~3 tokens/second for interactive use), or it evicts other apps from memory and crashes your system. Understanding the math up front saves you from downloading 40 GB of files only to discover they don't fit.
There's also a practical ceiling effect: if you have 8 GB of VRAM, you can run a fully GPU-accelerated 7B model at Q4 quantization. If you try to force a 13B model into that same 8 GB, the model will partially spill to your CPU, and you may get 2–5 tokens/second instead of 60+. Same file, same GPU, radically different experience.
The good news is the math is simple enough to do in your head once you understand parameters, bytes per weight, and quantization. A 7B model with 4-bit quantization needs roughly 4–5 GB. A 70B model at Q4 needs roughly 35–40 GB. You can spot-check any model you find on Hugging Face before spending a minute on the download.
How the math works
Every LLM is described by its parameter count — the number of individual weights in the model. A "7B" model has seven billion floats. At full precision (FP16), each float takes 2 bytes, so a 7B model weighs 7 × 10⁹ × 2 bytes = 14 GB just for weights. That's more than most consumer GPUs hold.
Quantization compresses those weights by rounding them to fewer bits. The most common format you'll encounter is GGUF (the file format used by llama.cpp and Ollama), which offers several quantization levels:
| Quantization | Bits per weight | Bytes per weight | 7B model size | 13B model size | 70B model size |
|---|---|---|---|---|---|
| FP16 (full) | 16 | 2.0 | ~14 GB | ~26 GB | ~140 GB |
| Q8_0 | 8 | 1.0 | ~7 GB | ~13 GB | ~70 GB |
| Q6_K | 6 | 0.78 | ~5.5 GB | ~10 GB | ~55 GB |
| Q5_K_M | 5 | 0.68 | ~4.8 GB | ~9 GB | ~48 GB |
| Q4_K_M | 4.5 (mixed) | 0.57 | ~4.1 GB | ~7.5 GB | ~40 GB |
| Q3_K_M | 3 | 0.38 | ~2.8 GB | ~5 GB | ~26 GB |
| Q2_K | 2 | 0.26 | ~2.0 GB | ~3.5 GB | ~18 GB |
Model weights are not the only thing eating memory. Two other consumers matter:
- KV cache — during inference, the model stores intermediate attention keys and values for every token in the context window. This grows linearly with context length. An 8K context window adds roughly 1–3 GB on a 7B model; a 128K context on a 32B model can consume over 32 GB of cache alone.
- Runtime overhead — the inference framework (Ollama, llama.cpp, etc.) uses an additional few hundred MB to ~1 GB for buffers, scratch space, and the loaded executable.
A practical rule of thumb: budget 10–20% on top of the raw weight size for the KV cache at a typical 4K–8K context. So a Q4_K_M 7B model needs its ~4.1 GB for weights plus roughly 0.5–1 GB overhead = about 5–5.5 GB total to load comfortably.
Quick sizing guide by hardware
The table below maps common consumer GPU VRAM tiers to the largest model you can run fully on-GPU at Q4_K_M quantization with an 8K context window. Running a model fully on the GPU is critical for usable speed — as soon as layers spill to system RAM, you may drop from 60+ tokens/second to 2–5.
| VRAM | Typical GPU | Largest model (full GPU, Q4_K_M) | Notes |
|---|---|---|---|
| 4 GB | RTX 3050, RX 6500 XT | 3B–4B | Very constrained; Phi-3-mini or similar tiny models |
| 6 GB | RTX 3060 (6 GB), RX 7600 | 7B (tight) | 7B at Q4_K_M fits but leaves little room for long context |
| 8 GB | RTX 3070, RX 6700 XT | 7B (comfortable) | Sweet spot for the 7B class; Q5_K_M fits too |
| 12 GB | RTX 3060 (12 GB), RTX 4070 | 7B at Q8 or 13B at Q4_K_M | Excellent for 7B; handles most 13B variants |
| 16 GB | RX 7900 GRE, RTX 4080 Super | 13B at Q8 or 14B/16B at Q4_K_M | Comfortable for mid-size models |
| 24 GB | RTX 3090, RTX 4090, RX 7900 XTX | 32B at Q4_K_M or 13B at FP16 | Handles nearly all popular open models |
| 32 GB | RTX 5090 | 70B at Q2_K or 34B at Q4_K_M | 70B still requires aggressive quantization |
| 48 GB | RTX A6000, 2x RTX 3090 | 70B at Q4_K_M (full GPU) | The practical minimum for a good 70B experience |
For Apple Silicon Macs, unified memory means the full pool is available to the GPU. An M3 Pro with 36 GB can fully load a 32B model at Q4_K_M. An M3 Max or M4 Max with 64 GB handles a 70B model at Q4_K_M with room to spare. Apple's MLX framework delivers 10–25% faster inference than llama.cpp on the same chip by exploiting the unified memory architecture natively. An M4 Max benchmarks at roughly 70 tokens/second on 70B models — faster than many multi-GPU desktop rigs.
CPU offloading and system RAM
When a model is too large to fit entirely in VRAM, tools like llama.cpp and Ollama let you partially offload layers to the GPU. The rest of the model stays in system RAM and runs on the CPU. This is controlled with the --n-gpu-layers flag in llama.cpp (or the equivalent num_gpu setting in Ollama).
Setting --n-gpu-layers -1 (or 999) tells llama.cpp to load as many layers as VRAM allows and overflow the rest to CPU. The GPU still accelerates every layer it holds; the CPU handles the rest. This means partial offloading is almost always worth enabling, even if only half the layers fit — you'll get noticeably better speed than pure CPU inference.
The performance drop from partial offloading can be severe. A 70B model with half its layers on CPU might generate 3–8 tokens/second instead of 30+. But it will work, and it keeps larger models accessible on mid-range hardware.
For CPU-only inference (no GPU at all), you need enough system RAM to hold the entire quantized model plus overhead. Minimum recommendations:
- 7B Q4_K_M: 8 GB RAM minimum, 16 GB comfortable — expect 5–15 tokens/second on a modern 8-core CPU
- 13B Q4_K_M: 16 GB RAM minimum, 32 GB comfortable — expect 3–8 tokens/second
- 70B Q4_K_M: 48 GB RAM minimum — expect under 3 tokens/second, often 0.5–1.5
Going deeper
Once you've matched your hardware to a model size, a few advanced topics affect real-world performance:
Memory bandwidth is often the real bottleneck
VRAM capacity tells you whether a model fits. Memory bandwidth tells you how fast tokens generate. LLM inference is memory-bandwidth-bound, not compute-bound: the GPU spends most of its time streaming weights from VRAM, not doing matrix multiplications. The RTX 4090 has 1,008 GB/s of memory bandwidth; an Apple M4 Max reaches 546 GB/s. An RTX 3090 at 936 GB/s often matches the 4090 on tokens/second despite slower compute, because bandwidth is the ceiling.
Context length multiplies KV cache cost
If you need a 128K context window instead of the default 4K–8K, the KV cache can grow from ~1 GB to 30+ GB on a 7B model. Running long contexts requires either a GPU with substantial VRAM headroom or a framework that supports KV cache quantization (compressing the cache to 8-bit or 4-bit FP8/INT8). Ollama and llama.cpp both support --cache-type-k q8_0 to halve cache memory use with minimal quality impact.
Mixture-of-Experts models change the math
Models like Mixtral 8x7B and Qwen3-235B-A22B are Mixture-of-Experts (MoE) architectures. They have a large total parameter count but only activate a fraction (the "active parameters") per token. Mixtral 8x7B has 47B total parameters but only 13B are active per forward pass. You still need to load all 47B of weights into memory (about 27 GB at Q4_K_M), but the compute cost per token is closer to a 13B model. For MoE models, VRAM is still set by total parameters, not active ones.
The quick formula to remember
VRAM needed (GB) ≈ (parameters_billions × bytes_per_weight) + kv_cache_GB + 1 GB overhead
For Q4_K_M: bytes_per_weight ≈ 0.57
For Q8_0: bytes_per_weight ≈ 1.0
For FP16: bytes_per_weight = 2.0
KV cache at 8K context ≈ 0.5–2 GB for 7B–13B models
KV cache at 8K context ≈ 4–8 GB for 70B models
Examples:
7B Q4_K_M: 7 × 0.57 + 1 + 1 ≈ 6 GB
13B Q4_K_M: 13 × 0.57 + 2 + 1 ≈ 10.5 GB
70B Q4_K_M: 70 × 0.57 + 6 + 1 ≈ 46.9 GBThe actual file size you see on Hugging Face or Ollama's model page is the most reliable check — it already accounts for the specific quantization variant's mixed-precision details. If the file is smaller than your VRAM, it will load. Add ~10–20% for KV cache and overhead to get your true minimum.
FAQ
Can I run a 70B model on a single RTX 4090?
Not comfortably at full GPU speed. A Q4_K_M 70B model weighs around 40 GB, but the RTX 4090 only has 24 GB of VRAM. You can partially offload layers to CPU RAM and run the model, but expect 3–8 tokens/second instead of 30+. For a good 70B experience you need either dual 24 GB GPUs, a 48 GB professional GPU like the RTX A6000, or an Apple Silicon Mac with 64 GB+ of unified memory.
What is the minimum GPU to run a 7B model at a usable speed?
An 8 GB VRAM GPU — such as an RTX 3070 or RTX 4060 — comfortably loads a 7B model at Q4_K_M quantization (~4–5 GB) and generates 40–80 tokens/second. A 6 GB card like the RTX 3060 (6 GB variant) fits the weights but leaves almost no headroom for long contexts and may run slower. If you're buying new, the RTX 4060 Ti (16 GB) is the value sweet spot for the 7B–13B range.
Does system RAM matter if I have a dedicated GPU?
Yes, in two ways. First, if the model overflows your VRAM (partial offloading), the overflow layers run in system RAM — so you need enough RAM to hold the entire model. Second, system RAM holds the model file while it streams to the GPU at startup. 32 GB of system RAM is the practical minimum for 70B models even if you have a large GPU; 16 GB is fine for the 7B–13B range.
How does quantization affect quality — is Q4_K_M really good enough?
For most practical tasks, yes. Q4_K_M loses approximately 1–3% on standard benchmarks like MMLU compared to FP16, which is imperceptible in conversation, coding, and writing. You may notice degradation on tasks requiring precise numerical reasoning or very low-frequency knowledge. Q5_K_M and Q6_K close that gap further if you have the VRAM to spare. Q2_K has more noticeable quality loss and is only worth using under severe memory pressure.
Why does a model need more memory than just the file size?
The file on disk is only the compressed weights. At runtime, three more things fill memory: the KV cache (intermediate attention data that grows with your context window), activations (temporary working memory computed between layers during generation), and framework overhead (Ollama or llama.cpp itself needs 500 MB–1 GB). Budget 10–20% above the file size as a comfortable minimum.
Can I run LLMs on integrated graphics or a laptop GPU?
Integrated graphics (Intel UHD, AMD Radeon in Ryzen chips) share system RAM but offer very little VRAM — typically 2–4 GB reserved — and have much lower memory bandwidth than discrete GPUs. They can run 3B–7B models at Q4_K_M via CPU inference, but you won't get GPU acceleration. Laptop discrete GPUs (RTX 4060 Laptop, RTX 4070 Laptop) have the same VRAM as their desktop variants but lower power limits, so they're slower — expect roughly half the tokens/second of a desktop card with equivalent VRAM.