In plain English
Strip away the chat bubble and a large language model is just a giant pile of numbers — its weights — and one operation repeated billions of times: multiply a list of numbers by a grid of numbers, add the results, repeat. That operation is called matrix multiplication, and a single reply from a model can involve trillions of these multiply-and-add steps.
Here is the everyday analogy. Imagine you need to tally 10,000 separate grocery receipts. A CPU is one brilliant accountant: incredibly fast, can do tricky tax logic in their head, but works through the receipts one at a time. A GPU is a stadium full of interns: each one can only add and multiply, none of them is clever, but you hand all 10,000 receipts out at once and they finish in the time it takes the accountant to do a handful.
LLM math is exactly the second kind of job. There is almost no clever branching or decision-making — just an ocean of simple, independent multiplications that can all happen at the same time. The GPU was built for that ocean. The CPU was not. That single mismatch is why nearly every AI model you call runs on GPUs (or their cousins, TPUs), and why the hardware itself ends up setting the price of AI.
Why it matters
You do not need to buy a GPU to use an LLM — but every bill, every rate limit, and every "the model is slow today" traces back to one. Understanding the hardware turns a lot of mysterious behavior into something you can predict.
- Cost. Datacenter GPUs cost tens of thousands of dollars each and there is a global shortage. When you pay per token, you are really renting a slice of one of these chips for a few milliseconds. That is why AI compute is expensive — the math is cheap, but the silicon that does it fast is scarce.
- Speed. How fast tokens stream back to you is mostly a hardware question: how quickly the GPU can read the model's weights out of its memory. Bigger model, slower stream.
- Capacity and rate limits. A provider only has so many GPUs. Rate limits exist because your request is competing with everyone else's for the same finite pool of chips.
- What you can run yourself. If you want a local open model, the model's size in memory decides whether it fits on your laptop, needs a gaming GPU, or needs a server you cannot afford.
In short: the GPU is the meter. Once you can picture it, decisions like "which model should I use" or "why did my bill spike" stop being guesswork. It also explains the headlines — the reason labs spend billions on chips is the same reason your scaling laws work: more compute, applied to more data, makes better models.
How it works
When you send a prompt, your text is split into tokens, each token becomes a list of numbers (a vector), and that vector is pushed through layer after layer of the transformer. Every layer is, at bottom, the same move: take your vector, multiply it by the layer's weight matrix, do that for attention and for the feed-forward block, then pass the result to the next layer. Output one token, append it, do the whole thing again for the next token.
The thing that makes this a GPU job is independence. To multiply a vector by a matrix, you compute many dot products — and none of them depends on the others. Cell (1,1) of the answer does not need to wait for cell (2,1). So you can hand all of them out at once.
- 8-64 large, complex cores
- Built for branching logic and one-thing-at-a-time speed
- Works through the multiplications in small batches
- Great at a web server; slow at a wall of independent math
- Thousands of simple cores + dedicated Tensor Cores
- Built to run the same operation on huge batches at once
- Computes thousands of multiply-adds every cycle
- Wins by raw throughput, not cleverness
Modern datacenter GPUs add Tensor Cores — circuits whose only job is to multiply small tiles of numbers and accumulate the result in a single step. They also use lower-precision number formats (FP16, FP8, and as of mid-2026 even FP4 on NVIDIA's Blackwell chips). Squeezing each number into fewer bits means more of them fit in memory and each multiply is cheaper — so you trade a sliver of accuracy for a lot of speed. This is the same idea behind quantization.
But raw multiplying power is only half the story. The other half is memory, and it is usually the real bottleneck.
VRAM: why memory, not math, is the wall
A GPU has its own super-fast memory called VRAM (video RAM — the name is a leftover from graphics). It is separate from your computer's normal RAM, and it is where the model has to live while it runs. The catch: VRAM is small and pricey. A top datacenter GPU as of mid-2026 holds on the order of ~180 GB; a high-end gaming card holds 24-32 GB.
To run a model, its weights must fit in VRAM. A rough rule of thumb: a model needs about 2 bytes of VRAM per parameter at half precision (FP16). So a 70-billion-parameter model needs roughly 140 GB just for the weights — before you have stored a single token of your prompt. That is why a 70B model does not fit on a laptop, and why providers run it across multiple linked GPUs.
| Model size | FP16 weights (~2 B/param) | 4-bit weights (~0.5 B/param) | Roughly fits on |
|---|---|---|---|
| 8B | ~16 GB | ~4-5 GB | A gaming GPU, even a good laptop at 4-bit |
| 70B | ~140 GB | ~35-45 GB | Multiple datacenter GPUs (FP16); one 48 GB card at 4-bit |
| ~700B-class | ~1.4 TB | ~350 GB+ | A whole rack of linked GPUs |
There is a second memory cost that grows while you talk: the KV cache. As the model generates, it stores the attention keys and values for every token so it does not have to recompute the whole conversation each step. The longer your context window gets, the more VRAM the KV cache eats — at long context it can rival the model weights themselves. That is one concrete reason long-context requests cost more and why providers cap how much you can stuff in.
Estimate it yourself
You can sanity-check whether a model fits on a given GPU with grade-school arithmetic. Here is the back-of-envelope calculation providers use, turned into a tiny script.
def vram_gb(params_billions: float, bits_per_param: int) -> float:
"""Rough VRAM needed JUST for model weights."""
bytes_per_param = bits_per_param / 8
total_bytes = params_billions * 1e9 * bytes_per_param
return total_bytes / 1e9 # to gigabytes
for bits in (16, 8, 4):
need = vram_gb(70, bits)
print(f"70B @ {bits:>2}-bit -> ~{need:6.1f} GB weights")
# 70B @ 16-bit -> ~ 140.0 GB weights
# 70B @ 8-bit -> ~ 70.0 GB weights
# 70B @ 4-bit -> ~ 35.0 GB weightsThe pattern is the whole game: halve the precision, roughly halve the memory. Going from 16-bit to 4-bit shrinks a 70B model from "needs a small server" to "fits on one beefy GPU." That is why quantization is the single most important trick for running big models on small hardware — and why a model you can use through an API may be impossible to run on your own machine at full precision.
The mid-2026 hardware landscape
A quick, current map of the chips and tools behind your model calls (figures verified as of mid-2026 — this space moves fast):
The chips
- NVIDIA GPUs dominate. The H100 (80 GB HBM3) was the workhorse of the early generation; the newer Blackwell B200 carries ~180 GB of usable HBM3e, around 8 TB/s of memory bandwidth, native FP4 support, and NVLink 5 (~1.8 TB/s) to stitch many GPUs into one big memory pool. The reason labs link 8, 72, or hundreds of GPUs together is simple: no single chip's VRAM is big enough for a frontier model.
- Google TPUs are the main alternative, custom-built for this math. The latest, Ironwood (TPU v7), offers 192 GB per chip — 6x its Trillium predecessor — and powers Google's own Gemini models at scale.
- Apple Silicon and other unified-memory machines let enthusiasts run mid-size models locally, because the CPU and GPU share one large memory pool instead of a tiny separate VRAM bank.
Why the chips keep getting bigger
Frontier models keep growing, and the biggest ones now use Mixture-of-Experts designs — huge total parameter counts where only a fraction fire per token. That makes them cheaper to run per token but they still have to fit in memory in full, which keeps the pressure on VRAM. Meanwhile context windows have ballooned: flagship models like Claude Opus 4.x and Gemini 3 advertise 1-million-token (and larger) windows as of mid-2026, and every one of those tokens lands in the KV cache — more memory pressure again.
The software that stretches the silicon
- vLLM — an open-source serving engine whose PagedAttention manages the KV cache like an operating system manages memory pages, cutting wasted VRAM from 60-80% down to a few percent and serving far more users per GPU.
- FlashAttention — an attention algorithm that does the same math while reading/writing far less to memory, directly attacking the bandwidth bottleneck. See FlashAttention explained.
- llama.cpp and similar runtimes — let you run quantized open models on consumer hardware, including laptops and Apple Silicon.
Going deeper
Once you accept that memory is the wall, the advanced view falls into place: an LLM running on a GPU spends its life in one of two very different regimes, and almost every optimization targets one of them.
Prefill reads your entire prompt in one parallel pass. Because there is a mountain of independent math and the weights get reused across all your prompt tokens, it is compute-bound — this is where the GPU's raw FLOPS shine. Decode generates tokens one by one. Each new token re-reads the full set of weights but only does a thin slice of math, so the chip sits waiting on memory: it is memory-bandwidth-bound. This split is why time-to-first-token and tokens-per-second are two separate numbers, and why long prompts and long outputs cost differently.
Two consequences worth internalizing:
- Batching is the economic engine of inference. Because decode wastes most of the chip waiting on memory, providers run many users' requests together in one batch. The weights get read from VRAM once and reused across everyone in the batch, so throughput climbs dramatically without proportional cost. Your per-token price is low precisely because you are sharing a GPU with strangers. This is also why a private, single-user deployment is less cost-efficient than an API.
- Lower precision is a memory strategy, not just a speed trick. FP8 and FP4 weights (and FP8 KV caches) exist mostly to move fewer bytes through that bandwidth bottleneck and to fit more in VRAM. The accuracy you trade away is the toll you pay to get under the memory wall.
Put it all together and the economics of AI become legible. The math is embarrassingly parallel, so it runs on thousands of tiny cores. Those cores starve without fast memory, and fast memory is the scarce, expensive part. Every lever the industry pulls — bigger HBM, faster NVLink, quantization, FlashAttention, PagedAttention, giant batches — is some way of getting more useful tokens out of a fixed, costly pool of VRAM. That is the whole story of why LLMs need GPUs, and why those GPUs set the price of everything you build on top of them.
FAQ
Can you run an LLM on a CPU instead of a GPU?
Yes, technically — a CPU can do the exact same matrix math, and small quantized models run acceptably on a modern CPU. But for anything large or fast, a CPU is orders of magnitude slower because it can only run a handful of multiplications in parallel, while a GPU runs thousands. For real-time chat with a big model, a GPU (or TPU) is effectively required.
What is VRAM and why does it matter for AI?
VRAM is the fast memory built into a GPU, separate from your computer's regular RAM. The model's weights have to fit in VRAM to run, and VRAM is small and expensive. As a rule of thumb a model needs about 2 bytes of VRAM per parameter at FP16 — so a 70B model needs roughly 140 GB just for weights. VRAM capacity is usually what decides whether a given model fits on a given GPU.
Why is AI compute so expensive?
The arithmetic itself is cheap; the bottleneck is the hardware that does it fast. Datacenter GPUs cost tens of thousands of dollars each, there is a global shortage, and frontier models need many of them linked together. When you pay per token you are renting milliseconds on one of these chips. Providers keep prices as low as they are mainly by batching many users onto each GPU.
What's the difference between a GPU and a TPU for LLMs?
Both are parallel processors built for matrix math. GPUs (mostly NVIDIA's H100 and Blackwell B200 as of mid-2026) are general-purpose accelerators used across the industry. TPUs are Google's custom chips designed specifically for neural-network math; the latest, Ironwood (TPU v7), has 192 GB per chip and powers Google's own Gemini models. For most users the practical difference is invisible — both deliver the parallelism LLMs need.
Why does a bigger model respond more slowly?
Generating each token requires reading the model's entire set of weights out of VRAM. A bigger model has more weights to read, so each token takes longer — even if the GPU's raw math capacity is barely used. This is called being memory-bandwidth-bound, and it's why model size, not cleverness, is the main driver of streaming speed.
How much GPU memory do I need to run a 70B model locally?
At full FP16 precision, about 140 GB — more than any single consumer GPU. With 4-bit quantization it drops to roughly 35-45 GB, which can fit on a single high-end card (such as a 48 GB GPU), plus extra headroom for the KV cache. Quantization is the standard trick that makes large models runnable on small hardware, at the cost of a little accuracy.