AI/TLDR

Tokens per Second Explained: How to Measure Local LLM Speed

You will know how to read and measure tokens-per-second numbers, and judge whether a setup is fast enough before you commit to a model.

BEGINNER10 MIN READUPDATED 2026-06-13

In plain English

When you run a local LLM, the model doesn't write whole sentences at once. It produces text one token at a time — a token being a small chunk of text, roughly three-quarters of a word on average. Tokens per second (often written tok/s or t/s) is simply how many of those chunks the model can produce each second. It's the headline number people quote when they say a setup feels "fast" or "painfully slow."

Tokens per Second — illustration
Tokens per Second — testingcatalog.com

Picture someone reading a book aloud to you. A fast, fluent reader runs ahead of your eyes; a slow one makes you wait, word by word, for the sentence to finish. Tokens per second is the model's reading-aloud speed. At 40 tok/s the words pour out faster than you can read them. At 3 tok/s you watch each word appear like a 1990s dial-up web page loading line by line.

There's a catch most people miss: there isn't one speed. A model has two very different jobs in a single request, and each has its own tokens-per-second number. First it has to read your prompt — every word of the question, the system instructions, and any pasted documents. Then it has to write the answer. These two phases run at wildly different speeds and are bottlenecked by completely different things. If you only ever look at one number, you'll be surprised when a long prompt suddenly feels slow even though the model "benchmarks fast."

Why it matters

If you're choosing a model or a piece of hardware to run it on, tokens per second is the single number that decides whether the experience is pleasant or unusable. A model can be smart, well-quantized, and free, and still be the wrong choice because it crawls on your machine. Speed is a feature.

It matters most because your comfortable speed depends entirely on what you're doing:

  • Interactive chat. You're reading along as the model types. The honest target is to beat your own reading speed — most people read 5 to 8 words per second, so anything above roughly 10 tok/s feels live, and 20+ feels effortless. Below ~5 tok/s, chatting becomes a chore.
  • Coding assistants and autocomplete. Here latency is brutal. You want the suggestion now, before you've typed the next character. Slow generation breaks your flow, so coding setups push for high tok/s and short prompts.
  • Bulk or batch jobs. Summarizing 10,000 documents overnight? You don't watch it. Total throughput matters, not per-token feel. A setup that's too slow to chat with can be perfectly fine for a job that runs while you sleep.

It also matters for budgeting hardware. The reason a bigger or differently-quantized model runs slower is usually memory bandwidth, not raw compute — and that single fact, explained below, predicts performance better than the GPU's advertised TFLOPS. Understanding tok/s is how you read a benchmark and know whether a given GPU, Mac, or laptop will actually be comfortable before you spend the money. For the hardware side, see local LLM hardware requirements.

How it works

A single request flows through the two phases, and each produces its own tokens-per-second number. Understanding what bottlenecks each one is the whole game.

Prefill: reading the prompt

In prefill, the model processes every token of your prompt at once. Because all the tokens are already known, the hardware can crunch them in parallel — a big matrix multiply that uses the GPU's compute units efficiently. Prefill is therefore compute-bound and usually fast: hundreds or even thousands of tokens per second. The cost of prefill scales with prompt length, so a giant prompt (a long document, a big system message) takes real time before any answer appears. That delay is time to first token (TTFT) — the pause you stare at after hitting enter.

Decode: writing the answer

Decode is different and slower. The model generates one token, then must feed that token back in to generate the next — it can't predict token 50 until it has produced token 49. This is sequential, so there's no parallelism to exploit. For each and every token, the hardware has to read the model's entire set of weights out of memory. That's the bottleneck: not how fast the chip can multiply, but how fast it can move the weights from memory to the compute units. Decode is memory-bandwidth-bound.

This is why the headline spec to check before buying isn't the GPU's TFLOPS — it's its memory bandwidth (GB/s). It's also why an Apple Silicon Mac with unified memory can run big models respectably: lots of fast, wide memory. And it's why a model that barely fits in your VRAM runs fine but the moment it spills into slow system RAM, decode falls off a cliff — system RAM has a fraction of the bandwidth of GPU memory.

Reading the timing output

You don't have to guess these numbers — llama.cpp prints them after every generation, and Ollama exposes them too. Once you know which line is which, you can diagnose a slow setup in seconds. (For background on the engines, see what is llama.cpp and what is Ollama.)

Here's a typical llama.cpp footer, lightly trimmed:

llama.cpp timing outputtext
prompt eval time =    512.30 ms /   256 tokens (   2.00 ms per token,  499.71 tokens per second)
       eval time =   4810.55 ms /   200 tokens (  24.05 ms per token,   41.58 tokens per second)
      total time =   5322.85 ms /   456 tokens
  • prompt eval = prefill. Here, 256 prompt tokens read at ~500 tok/s — fast, because it's parallel and compute-bound.
  • eval = decode. Here, 200 tokens generated at ~42 tok/s — this is the speed you feel as the answer streams. This is the number to quote.
  • ms per token is just the reciprocal. 24 ms/token = 1000 ÷ 24 ≈ 42 tok/s. Lower ms is better.

Ollama gives you the same breakdown if you ask for it. Run with verbose mode and read prompt eval rate (prefill) and eval rate (decode):

see Ollama's speed numbersbash
ollama run llama3.2 --verbose
# after a reply, Ollama prints:
#   prompt eval rate:   480.12 tokens/s   <- prefill
#   eval rate:           38.94 tokens/s   <- decode (the one to watch)

What counts as 'fast enough'?

There's no universal pass mark — it depends on the task. The table below is a practical feel-guide for decode speed (the generation number). Treat it as rough orientation, not a benchmark.

Decode speedHow it feelsGood for
< 3 tok/sPainful. Slower than reading; you wait on every line.Patience-testing only; offline batch if you must.
3–8 tok/sUsable but sluggish. Roughly matches slow reading.Casual chat, single-shot questions, overnight jobs.
10–20 tok/sComfortable. Keeps pace with or beats your reading.Everyday chat, writing help, Q&A.
20–50 tok/sSnappy. Text outruns your eyes.Interactive chat, light coding assistance.
50+ tok/sInstant-feeling. Latency disappears.Coding autocomplete, agents, high-volume work.

Two honest caveats. First, prefill speed matters separately for long prompts: a 30 tok/s decode setup still feels slow to start if you paste a 10,000-token document and prefill is weak — that's a long TTFT, not a slow stream. Second, throughput ≠ feel for batch work. If you're processing many requests at once, total tokens-per-second across the whole batch can be far higher than the per-request number, because the hardware fills idle time. A server doing 8 requests in parallel may post a big aggregate tok/s while each individual stream looks modest.

Going deeper

Once the basics click, a few nuances explain most of the surprises people hit when they start measuring seriously.

Decode slows as context grows. As the conversation gets longer, the model must attend to a growing KV cache (the stored keys and values for every previous token). That cache eats memory bandwidth too, so decode speed gradually drops over a long chat. A model posting 40 tok/s at the start of a session may be at 30 tok/s ten thousand tokens in. Benchmarks taken on a short prompt flatter real long-conversation speed.

Quantization buys speed and memory, not magic. Because decode is bandwidth-bound, shrinking the model with quantization speeds up generation roughly in proportion to how much smaller it gets — and lets bigger models fit in VRAM in the first place. The trade is a small quality cost at aggressive levels. This is why the GGUF format ships a model in many quant sizes; you pick the largest one that fits and still hits your target tok/s.

Batching boosts throughput, not single-stream speed. Serving engines built for many users (like vLLM) process multiple requests together, which raises aggregate tokens per second dramatically because the weights, read once from memory, serve several sequences at the same time. For a single local user typing one message, batching does nothing — your decode speed is what it is.

Watch out for thermal throttling and offloading. A laptop that benchmarks at 25 tok/s cold can sag after a few minutes as it heats up and the chip downclocks. And if part of the model is offloaded to CPU/system RAM because it doesn't fit in VRAM, the slow-memory portion drags the whole decode rate down — sometimes by an order of magnitude. When a number looks mysteriously bad, check whether the model fully fits in fast memory first.

The durable mental model: prefill is compute-bound and fast, decode is memory-bandwidth-bound and is the number you feel. Quote the decode (eval) rate, check memory bandwidth before you buy hardware, keep the whole model in fast memory, and pick a quant size that hits the tok/s your task actually needs — chat, coding, or batch.

FAQ

What is a good tokens per second for a local LLM?

For interactive chat, aim for at least 10 tok/s of generation (decode) speed; 20+ feels effortless because it outruns your reading. Below ~5 tok/s chatting gets tedious. For overnight batch jobs you don't watch, even 3–5 tok/s can be fine — "good" depends entirely on the task.

What is the difference between prompt eval and generation speed?

Prompt eval (prefill) is how fast the model reads your prompt — it processes all prompt tokens in parallel, so it's compute-bound and usually fast (hundreds of tok/s). Generation (decode) is how fast it writes the answer one token at a time, which is sequential and limited by memory bandwidth, so it's slower. The generation number is the one you feel as the text streams out.

Why is my LLM slow even though I have a powerful GPU?

Generation speed is capped by memory bandwidth, not raw compute, so a GPU with huge TFLOPS but modest bandwidth can still feel slow. The other common cause is the model not fitting fully in VRAM: once part of it spills into slower system RAM (offloading), decode speed drops sharply. Check bandwidth and whether the whole model fits in fast memory.

How do I measure tokens per second in Ollama?

Run the model with the verbose flag, for example ollama run llama3.2 --verbose, and after each reply Ollama prints prompt eval rate (prefill speed) and eval rate (decode speed). The eval rate is the generation tokens-per-second you should quote. Generate at least 100 tokens and run it twice, since the first run pays a one-time model-load cost.

Does quantization make a model faster?

Yes, usually. Because generation is limited by how fast the model's weights move through memory, shrinking the model with quantization speeds up decode roughly in proportion to the size reduction, and lets bigger models fit in fast VRAM. The cost is a small drop in output quality at aggressive quantization levels.

Why does my local LLM slow down during a long conversation?

As context grows, the model stores keys and values for every previous token (the KV cache), and reading that growing cache consumes memory bandwidth on every new token. Since decode is bandwidth-bound, tokens per second gradually falls as the conversation gets longer, even on the same hardware and model.

Further reading