In plain English
An LLM writes text one token at a time, and at every step it has a ranked list of candidates. For the prompt "The sky over the desert was..." it might think blue is very likely, grey somewhat likely, vast plausible, and banana absurd but not impossible. Temperature is the dial that decides how seriously the model takes that ranking.
Think of ordering at your favorite restaurant. Temperature 0 is the regular who orders the exact same dish every visit — the safest, most-proven choice, no exceptions. Temperature around 0.7 is someone who usually orders a favorite but sometimes tries the special. Temperature 1.5 is the friend who closes their eyes and points at the menu. Same restaurant, same menu, very different dinners.
That's the whole idea. Low temperature means the model almost always picks its top-ranked token, so output is focused and repeatable. High temperature means lower-ranked tokens get a real shot, so output is varied, creative — and more likely to go somewhere weird. Temperature doesn't make the model smarter or dumber. It changes how the model chooses from what it already believes, not what it believes.
Why it matters
Temperature is the first knob anyone touching an LLM API should understand, because the wrong setting fails silently. Nothing errors. The model just behaves slightly wrong in ways that are easy to misdiagnose as a bad prompt or a bad model.
Set it too high for a structured task and you get flaky output: JSON that parses nine times out of ten, SQL with an invented column name on run number twelve, an agent that suddenly improvises a step you never asked for. Set it too low for a creative task and you get the opposite failure: every product tagline sounds the same, every story opens with the same sentence, and regenerating gives you near-identical text because the model keeps picking the same safe tokens.
It also answers one of the most common beginner questions: why does the same prompt give different answers? By default, most chat models run at a temperature well above zero, so every response is a fresh set of weighted dice rolls. (Even at temperature 0 you may not get perfectly identical output — that rabbit hole has its own article on LLM determinism.)
Who should care: anyone calling an LLM API, building agents or pipelines, writing evals, or just wondering why ChatGPT phrased the same answer three different ways. Temperature is also the gateway to the rest of the sampling toolbox — once it clicks, top-p and top-k make sense in about two minutes.
How it works
Under the hood, an LLM is a next-token prediction machine. At each step it produces a raw score — called a logit — for every token in its vocabulary (tens of thousands of them). Those raw scores get converted into probabilities by a function called softmax, and then one token is drawn at random according to those probabilities. Temperature slots in right before the softmax: every logit is divided by the temperature value.
Dividing by a small number (T below 1) makes the gaps between scores bigger, so after softmax the top token hogs almost all the probability — the distribution gets sharper. Dividing by a large number (T above 1) shrinks the gaps, so probability spreads out across more tokens — the distribution gets flatter. At the extreme, T approaching 0 collapses to greedy decoding: always take the single highest-scored token. Here's what that does to our desert-sky example:
| Candidate token | T = 0.2 | T = 1.0 | T = 1.5 |
|---|---|---|---|
blue | ~98% | ~62% | ~52% |
grey | ~2% | ~28% | ~31% |
vast | under 0.1% | ~9% | ~15% |
banana | effectively 0% | ~0.7% | ~3% |
Notice what high temperature really does: banana goes from impossible to a 1-in-33 chance. Multiply that small risk across hundreds of tokens in a long response and a derailment somewhere becomes likely. Notice also what it doesn't do: the ranking never changes. blue is the favorite at every temperature. Temperature redistributes confidence; it never adds knowledge the model doesn't have.
What should you set it to?
There is no universally correct value — only a correct value for the task. The question to ask: does this task have one right answer, or many good answers? One right answer wants low temperature. Many good answers wants higher.
| Use case | Temperature | Why |
|---|---|---|
| Code, SQL, JSON, tool calls | 0 – 0.3 | There's one right answer. You want the top pick, every run. |
| Extraction, classification, factual Q&A | 0 – 0.3 | Consistency beats flair. Repeated runs should agree. |
| Chat, summaries, everyday writing | 0.5 – 0.8 | Natural variety without going off the rails. |
| Brainstorming, fiction, taglines | 0.9 – 1.2 | You want surprises. Generate many, keep the best. |
| Deliberate weirdness | 1.3+ | Output degrades into word salad fast. Fun, rarely useful. |
- Top token nearly always wins
- Repeatable, focused output
- Best for code and facts
- Risk: bland, repetitive text
- Mix of safe and fresh picks
- The default zone for chat
- Good for summaries and emails
- Risk: occasional odd phrasing
- Long shots get real chances
- Diverse, creative drafts
- Best for brainstorming
- Risk: derailment and nonsense
Two practical notes. First, providers cap the dial differently — some allow 0 to 1, others 0 to 2 — and their defaults differ too, so check what you're actually getting when you don't set it. Second, some reasoning-focused models lock their sampling settings entirely and will ignore or reject a temperature parameter, because their extended thinking process is tuned for a specific sampling setup.
See the math in code
The entire mechanism is about five lines of Python. No API key, no GPU — this is literally the math that runs inside every inference server, on a toy four-token vocabulary. Run it and watch the distribution sharpen and flatten:
import math
# Raw scores (logits) a model might assign to four candidate
# next tokens for the prompt "The sky over the desert was ..."
logits = {"blue": 5.0, "grey": 4.2, "vast": 3.1, "banana": 0.5}
def apply_temperature(logits, t):
scaled = {tok: s / t for tok, s in logits.items()} # the temperature step
total = sum(math.exp(s) for s in scaled.values())
return {tok: math.exp(s) / total for tok, s in scaled.items()} # softmax
for t in (0.2, 0.7, 1.0, 1.5):
probs = apply_temperature(logits, t)
row = " ".join(f"{tok} {p:6.1%}" for tok, p in probs.items())
print(f"T={t}: {row}")At T=0.2, blue swallows roughly 98% of the probability mass. At T=1.5, banana — a token the model scored as nearly nonsense — climbs to about 3%. In a real model this plays out over a vocabulary of tens of thousands of tokens, on every single generation step, which is why even small temperature changes compound noticeably over a long response.
Common pitfalls
- Expecting temperature 0 to be perfectly deterministic. It usually means greedy decoding, but floating-point quirks, batching, and infrastructure details can still produce occasional variation. If you need to know why, read why the same prompt gives different answers.
- Using low temperature to "fix" hallucinations. If the model's top-ranked answer is wrong, temperature 0 just gives you that wrong answer consistently. Hallucination is a knowledge and training problem, not a dice problem — it has its own causes and fixes.
- Cranking temperature and top-p at the same time. They both widen or narrow the candidate pool, and their effects interact in confusing ways. Standard advice: tune one, leave the other at its default.
- Tuning temperature when the prompt is the real problem. If output quality is bad at every temperature, no dial setting will save you. Fix the instructions first; use temperature to control variance second.
- Forgetting that defaults vary. One SDK defaults to 1.0, another to 0.7, a local runtime to something else entirely. If you never set temperature explicitly, you don't actually know what your pipeline is doing — set it on purpose.
Going deeper
Where the name comes from
The term is borrowed from statistical mechanics. The softmax-with-temperature formula is the Boltzmann distribution from physics, where temperature controls how much a system explores high-energy states: cold systems settle into the lowest-energy configuration, hot systems bounce around. Hinton, Vinyals, and Dean's 2015 knowledge-distillation paper cemented the term in deep learning — they raised the temperature on a large model's softmax to expose its "dark knowledge" (the relative probabilities of wrong answers) so a smaller student model could learn from it.
Temperature is an entropy dial
Formally, temperature monotonically controls the entropy of the output distribution — the average surprise per token. T below 1 lowers entropy toward 0 (deterministic), T above 1 raises it toward the uniform distribution where every token is equally likely. This framing explains a classic result from Holtzman et al.'s The Curious Case of Neural Text Degeneration: pure low-entropy decoding (greedy or beam search) makes language models loop and repeat, because the most-probable continuation of slightly repetitive text is more repetitive text. Human writing has surprisingly high per-token entropy; matching it requires some sampling randomness. That paper is why nucleus (top-p) sampling exists.
Sampler order and modern variants
Temperature rarely works alone. In a typical inference stack the logits flow through a pipeline: temperature scaling first, then truncation filters like top-k, top-p, or min-p (which drops tokens below a fraction of the top token's probability — popular in local-model runtimes because it adapts to how confident the model is). The order matters: scaling before truncation changes which tokens survive the cut, and runtimes like llama.cpp let you reorder the whole sampler chain. Hugging Face's transformers library implements each stage as a composable logits processor, which is the cleanest codebase to read if you want to see real implementations.
Using randomness on purpose
Counterintuitively, temperature above 0 can make systems more accurate. Self-consistency decoding samples the same reasoning question multiple times at moderate temperature, then majority-votes the final answers — diverse reasoning paths that converge on one answer are strong evidence it's right. The same idea powers best-of-N sampling in code generation: sample ten candidate solutions at T around 0.8, run the tests, keep the one that passes. Randomness becomes a search strategy rather than a defect. And if you want to observe what temperature is doing to a production model rather than trust the theory, request logprobs — the per-token probabilities — and watch the distribution shift as you move the dial.
Open problems remain. Models are not perfectly calibrated — their token probabilities don't always match real-world frequencies, and post-training (RLHF in particular) tends to sharpen distributions, which is partly why instruction-tuned chat models feel samey at the same temperature where a base model feels wild. Dynamic-temperature schemes that adjust T per token based on the model's entropy are an active tinkering area in the open-source community. The dial is simple; deciding where it should point, automatically, still isn't.
FAQ
Is temperature 0 deterministic?
Almost, but not guaranteed. Temperature 0 typically means greedy decoding — always pick the highest-probability token — but floating-point arithmetic, request batching, and other infrastructure details can still cause occasional run-to-run differences on hosted APIs. Treat it as "very repeatable," not "bit-identical."
What temperature should I use for coding and JSON output?
Low: 0 to 0.3. Code, SQL, and structured output have one correct answer, and any randomness only adds ways to break a parser or invent an identifier. If a machine consumes the output, pin the temperature near 0.
Can temperature go above 1, and what happens if it does?
Many APIs accept values up to 2. Above 1 the probability distribution gets flatter than the model's raw beliefs, so genuinely unlikely tokens start winning rolls. Around 1.3–1.5 most models drift into broken grammar and word salad. It's occasionally useful for maximum-diversity brainstorming, rarely for anything else.
Does lowering temperature reduce hallucinations?
Mostly no. Temperature changes how the model picks from what it already believes — it can't add missing knowledge. If the model's top-ranked completion is a confident fabrication, temperature 0 serves you that exact fabrication every time. Low temperature reduces variance, not wrongness.
What's the difference between temperature and top-p?
Temperature reshapes the whole probability distribution (sharper or flatter) but never removes any token. Top-p truncates it — only the smallest set of tokens whose probabilities sum to p stay eligible, and everything else is cut to zero. They're usually used together, but you should tune one at a time.