Softmax Function Machine Learning Guide: Turning Scores Into Probabilities

How softmax exponentiates and normalizes raw scores into a probability distribution that sums to 1.

INTERMEDIATE8 MIN READUPDATED 2026-06-12

In plain English

The softmax function is a small piece of math that takes a list of raw scores and turns them into a list of probabilities that add up to 1. Think of a panel of judges shouting numbers like 2.0, 1.0, and 0.1 for three contestants. Those numbers are not probabilities, they are just opinions on an arbitrary scale. Softmax is the scorekeeper who converts that messy chorus into a clean answer: contestant A wins 66% of the vote, B gets 24%, C gets 10%.

It does this in two moves: exponentiate every score (so bigger scores pull ahead and nothing stays negative), then normalize by dividing each one by the total. The result is always positive and always sums to 1, which is exactly what a probability distribution must do.

Why it matters

Most classifiers and language models do their thinking in a unitless number space. A neural network's final layer might say "cat = 4.2, dog = 1.1, bird = -0.5", but you cannot act on raw numbers like that. You cannot threshold them, compare them across examples, or feed them into a loss function that expects probabilities. Softmax is the universal adapter that makes those scores usable.

Interpretability. 0.91 means the model is 91% confident, a far more useful statement than a raw score of 4.2.
Training. The cross-entropy loss used to train classifiers needs a probability distribution as input. Softmax provides it.
Sampling. In an LLM, the next token is drawn from the softmax probabilities, which is how the model picks its next word.
Comparability. Because the outputs always sum to 1, you can compare confidence across different inputs and models on the same scale.

If you have ever wondered how a model goes from internal numbers to "the next word is probably 'the' with 31% likelihood", the answer is almost always softmax. It is one of the most-used functions in all of deep learning.

How it works

The softmax function machine learning formula

Given a vector of scores z = [z_1, z_2, ..., z_K], softmax computes the output for position i as exp(z_i) divided by the sum of exp(z_j) over every position j. In words: raise e to the power of each score, then divide each result by the total of all of them. The denominator is the normalization term that guarantees the outputs sum to exactly 1.

// From raw scores to probabilities

Logits[2.0, 1.0, 0.1]Exponentiatee^z: [7.39, 2.72, 1.11]Sum them7.39 + 2.72 + 1.11 = 11.22Divide each by sumnormalizeProbabilities[0.659, 0.242, 0.099] -> sum = 1

Two properties fall straight out of this recipe. First, every output lands strictly between 0 and 1, because exp is always positive and we divide by a sum that includes the numerator. Second, the outputs sum to 1 by construction. Together they make a valid probability distribution. A third, subtler property is that softmax amplifies the largest score: the gap between 2.0 and 1.0 (a difference of 1.0 in raw space) becomes a gap between 0.659 and 0.242 after exponentiation, because the exponential stretches differences apart.

Logit z_i	exp(z_i)	Probability
2.0	7.39	0.659
1.0	2.72	0.242
0.1	1.11	0.099

Softmax in a few lines of code

Here is softmax written from scratch, plus the one-line library version most people actually use. Notice the - np.max(z) in the from-scratch version, which we explain in the next section.

softmax.pypython

import numpy as np

def softmax(z):
    z = z - np.max(z)          # numerical stability (more below)
    exp_z = np.exp(z)          # exponentiate every score
    return exp_z / exp_z.sum() # normalize so outputs sum to 1

logits = np.array([2.0, 1.0, 0.1])
probs = softmax(logits)
print(probs)          # [0.659  0.242  0.099]
print(probs.sum())    # 1.0

# In practice you'd reach for a library:
# from scipy.special import softmax
# from torch import softmax  # torch.softmax(t, dim=-1)

Subtract the max for stability (this does not change the result).
Exponentiate each element.
Divide by the sum so the vector becomes a probability distribution.

Softmax vs argmax vs sigmoid

Softmax is easy to confuse with two neighbours. Argmax also points at the biggest score, but it returns a single hard winner with no notion of confidence. It is non-differentiable, so you cannot train through it with gradient descent, which is why models use the soft version during training and often fall back to argmax only at inference when they just need one answer (the so-called greedy choice, covered in greedy decoding vs sampling).

Sigmoid squashes one number into the range (0, 1) independently. It is the right tool when classes are not mutually exclusive (an image can be both outdoor and sunny). Softmax is the right tool when exactly one class should win and the probabilities must compete and sum to 1. A useful mental note: softmax over two classes is mathematically equivalent to a single sigmoid.

Function	Output	Sums to 1?	Differentiable?	Use when
softmax	vector of probabilities	Yes	Yes	one class wins among many
argmax	index of the winner	n/a	No	you only need the final pick
sigmoid	one probability per class	No	Yes	classes are independent (multi-label)

Where softmax lives inside an LLM

Large language models use softmax in two completely different places, which trips up a lot of newcomers.

1. The output layer (logits to next-token probabilities)

At the very end of the network, the model produces one logit per token in its vocabulary, often tens of thousands of them. Softmax turns that giant logit vector into a probability distribution over the whole vocabulary. The model then samples its next token from that distribution. This is the step that makes how LLMs work feel probabilistic rather than deterministic.

2. Inside attention

Deep inside every transformer layer, the attention mechanism scores how much each token should attend to every other token. Those scores are passed through softmax to become attention weights that sum to 1, so each token spends a fixed budget of attention across the sequence. The original Attention Is All You Need formula is softmax(QK^T / sqrt(d_k)) V, where the division by sqrt(d_k) keeps the scores from getting so large that softmax becomes overly peaked. See how attention works for the full picture.

// Two softmax jobs in a transformer

Softmax in an LLM

Attention weightsinside every layer: who attends to whom

Output distributionfinal layer: probability of each next token

Going deeper

Temperature: making softmax sharper or softer

You can divide every logit by a temperature T before applying softmax: softmax(z / T). A low temperature (T < 1) makes the distribution sharper and more confident, pushing mass onto the top choice. A high temperature (T > 1) flattens it toward uniform, giving rarer options a chance. At T = 1 you get plain softmax; as T approaches 0 softmax approaches argmax. This single knob is exactly the temperature setting you adjust in an LLM API to trade off focused versus creative output.

Temperature	Effect on distribution	Behaviour
T < 1 (e.g. 0.5)	sharper, peakier	more deterministic, repetitive
T = 1	standard softmax	model's natural distribution
T > 1 (e.g. 1.5)	flatter, smoother	more diverse, more random

The numerical-stability max-subtraction trick

Computers store floating-point numbers with a finite range. Exponentiating a large logit like 1000 produces a number too big to represent, which becomes inf and poisons the whole computation (and very negative logits can underflow to 0). The fix relies on the translation-invariance property from earlier: subtract the maximum logit from every element first. Now the largest exponent is exp(0) = 1 and everything else is between 0 and 1, so nothing can overflow, and the final probabilities are mathematically identical to the naive version.

stable_vs_naive.pypython

import numpy as np

z = np.array([1000.0, 1001.0, 1002.0])

# Naive: overflows to inf, returns nan
naive = np.exp(z) / np.exp(z).sum()
print(naive)            # [nan nan nan]

# Stable: subtract the max first, identical math, no overflow
z_shift = z - z.max()   # [-2., -1., 0.]
stable = np.exp(z_shift) / np.exp(z_shift).sum()
print(stable)           # [0.090  0.245  0.665]

Cost at scale and what comes after

Over a vocabulary of 100,000+ tokens, computing a full softmax for every position of every training step is expensive, which has motivated approximations like hierarchical softmax and sampled softmax in older systems. Modern frameworks fuse the softmax with the cross-entropy loss (the log-softmax path) for both speed and stability. The takeaway: softmax is conceptually a two-line function, but squeezing it onto a GPU at frontier-model scale is a genuine engineering problem.

FAQ

What is the softmax function machine learning models use?

Softmax is a function that converts a vector of raw scores (logits) into a probability distribution. It exponentiates each score and divides by the total, so every output is positive and all outputs sum to 1. It is the standard bridge from a model's internal scores to interpretable probabilities.

What is the difference between softmax and argmax?

Argmax returns only the index of the single largest score with no confidence value and is not differentiable, so you cannot train through it. Softmax returns a full probability distribution over all options, is differentiable, and preserves the ranking. Models train with softmax and may use argmax at inference for a single hard pick.

When should I use softmax instead of sigmoid?

Use softmax when exactly one class should win and the probabilities must compete and sum to 1 (single-label classification, next-token prediction). Use sigmoid when classes are independent and more than one can be true at once (multi-label tasks). Softmax over two classes equals a single sigmoid.

Why do you subtract the max in softmax?

Exponentiating a large logit can overflow to infinity and break the computation. Because softmax is unchanged when you add or subtract a constant from every input, subtracting the maximum logit makes the largest exponent exp(0) = 1 and prevents overflow, while producing mathematically identical probabilities.

How does temperature change softmax?

Temperature T divides the logits before softmax. A temperature below 1 sharpens the distribution and makes the model more confident and deterministic; a temperature above 1 flattens it and increases diversity. At T = 1 you get standard softmax, and as T approaches 0 softmax approaches argmax.

// In plain English

// Why it matters

// How it works

The softmax function machine learning formula

// Softmax in a few lines of code

// Softmax vs argmax vs sigmoid

// Where softmax lives inside an LLM

1. The output layer (logits to next-token probabilities)

2. Inside attention

// Going deeper

Temperature: making softmax sharper or softer

The numerical-stability max-subtraction trick

Cost at scale and what comes after

// FAQ

// Further reading

// Related

In plain English

Why it matters

How it works

Softmax in a few lines of code

Softmax vs argmax vs sigmoid

Where softmax lives inside an LLM

Going deeper

FAQ

Further reading

Related