In plain English
The softmax function is a small piece of math that takes a list of raw scores and turns them into a list of probabilities that add up to 1. Think of a panel of judges shouting numbers like 2.0, 1.0, and 0.1 for three contestants. Those numbers are not probabilities, they are just opinions on an arbitrary scale. Softmax is the scorekeeper who converts that messy chorus into a clean answer: contestant A wins 66% of the vote, B gets 24%, C gets 10%.
It does this in two moves: exponentiate every score (so bigger scores pull ahead and nothing stays negative), then normalize by dividing each one by the total. The result is always positive and always sums to 1, which is exactly what a probability distribution must do.
Why it matters
Most classifiers and language models do their thinking in a unitless number space. A neural network's final layer might say "cat = 4.2, dog = 1.1, bird = -0.5", but you cannot act on raw numbers like that. You cannot threshold them, compare them across examples, or feed them into a loss function that expects probabilities. Softmax is the universal adapter that makes those scores usable.
- Interpretability.
0.91means the model is 91% confident, a far more useful statement than a raw score of4.2. - Training. The cross-entropy loss used to train classifiers needs a probability distribution as input. Softmax provides it.
- Sampling. In an LLM, the next token is drawn from the softmax probabilities, which is how the model picks its next word.
- Comparability. Because the outputs always sum to 1, you can compare confidence across different inputs and models on the same scale.
If you have ever wondered how a model goes from internal numbers to "the next word is probably 'the' with 31% likelihood", the answer is almost always softmax. It is one of the most-used functions in all of deep learning.
How it works
The softmax function machine learning formula
Given a vector of scores z = [z_1, z_2, ..., z_K], softmax computes the output for position i as exp(z_i) divided by the sum of exp(z_j) over every position j. In words: raise e to the power of each score, then divide each result by the total of all of them. The denominator is the normalization term that guarantees the outputs sum to exactly 1.
Two properties fall straight out of this recipe. First, every output lands strictly between 0 and 1, because exp is always positive and we divide by a sum that includes the numerator. Second, the outputs sum to 1 by construction. Together they make a valid probability distribution. A third, subtler property is that softmax amplifies the largest score: the gap between 2.0 and 1.0 (a difference of 1.0 in raw space) becomes a gap between 0.659 and 0.242 after exponentiation, because the exponential stretches differences apart.
| Logit z_i | exp(z_i) | Probability |
|---|---|---|
| 2.0 | 7.39 | 0.659 |
| 1.0 | 2.72 | 0.242 |
| 0.1 | 1.11 | 0.099 |
Softmax in a few lines of code
Here is softmax written from scratch, plus the one-line library version most people actually use. Notice the - np.max(z) in the from-scratch version, which we explain in the next section.
import numpy as np
def softmax(z):
z = z - np.max(z) # numerical stability (more below)
exp_z = np.exp(z) # exponentiate every score
return exp_z / exp_z.sum() # normalize so outputs sum to 1
logits = np.array([2.0, 1.0, 0.1])
probs = softmax(logits)
print(probs) # [0.659 0.242 0.099]
print(probs.sum()) # 1.0
# In practice you'd reach for a library:
# from scipy.special import softmax
# from torch import softmax # torch.softmax(t, dim=-1)- Subtract the max for stability (this does not change the result).
- Exponentiate each element.
- Divide by the sum so the vector becomes a probability distribution.
Softmax vs argmax vs sigmoid
Softmax is easy to confuse with two neighbours. Argmax also points at the biggest score, but it returns a single hard winner with no notion of confidence. It is non-differentiable, so you cannot train through it with gradient descent, which is why models use the soft version during training and often fall back to argmax only at inference when they just need one answer (the so-called greedy choice, covered in greedy decoding vs sampling).
Sigmoid squashes one number into the range (0, 1) independently. It is the right tool when classes are not mutually exclusive (an image can be both outdoor and sunny). Softmax is the right tool when exactly one class should win and the probabilities must compete and sum to 1. A useful mental note: softmax over two classes is mathematically equivalent to a single sigmoid.
| Function | Output | Sums to 1? | Differentiable? | Use when |
|---|---|---|---|---|
| softmax | vector of probabilities | Yes | Yes | one class wins among many |
| argmax | index of the winner | n/a | No | you only need the final pick |
| sigmoid | one probability per class | No | Yes | classes are independent (multi-label) |
Where softmax lives inside an LLM
Large language models use softmax in two completely different places, which trips up a lot of newcomers.
1. The output layer (logits to next-token probabilities)
At the very end of the network, the model produces one logit per token in its vocabulary, often tens of thousands of them. Softmax turns that giant logit vector into a probability distribution over the whole vocabulary. The model then samples its next token from that distribution. This is the step that makes how LLMs work feel probabilistic rather than deterministic.
2. Inside attention
Deep inside every transformer layer, the attention mechanism scores how much each token should attend to every other token. Those scores are passed through softmax to become attention weights that sum to 1, so each token spends a fixed budget of attention across the sequence. The original Attention Is All You Need formula is softmax(QK^T / sqrt(d_k)) V, where the division by sqrt(d_k) keeps the scores from getting so large that softmax becomes overly peaked. See how attention works for the full picture.
Going deeper
Temperature: making softmax sharper or softer
You can divide every logit by a temperature T before applying softmax: softmax(z / T). A low temperature (T < 1) makes the distribution sharper and more confident, pushing mass onto the top choice. A high temperature (T > 1) flattens it toward uniform, giving rarer options a chance. At T = 1 you get plain softmax; as T approaches 0 softmax approaches argmax. This single knob is exactly the temperature setting you adjust in an LLM API to trade off focused versus creative output.
| Temperature | Effect on distribution | Behaviour |
|---|---|---|
| T < 1 (e.g. 0.5) | sharper, peakier | more deterministic, repetitive |
| T = 1 | standard softmax | model's natural distribution |
| T > 1 (e.g. 1.5) | flatter, smoother | more diverse, more random |
The numerical-stability max-subtraction trick
Computers store floating-point numbers with a finite range. Exponentiating a large logit like 1000 produces a number too big to represent, which becomes inf and poisons the whole computation (and very negative logits can underflow to 0). The fix relies on the translation-invariance property from earlier: subtract the maximum logit from every element first. Now the largest exponent is exp(0) = 1 and everything else is between 0 and 1, so nothing can overflow, and the final probabilities are mathematically identical to the naive version.
import numpy as np
z = np.array([1000.0, 1001.0, 1002.0])
# Naive: overflows to inf, returns nan
naive = np.exp(z) / np.exp(z).sum()
print(naive) # [nan nan nan]
# Stable: subtract the max first, identical math, no overflow
z_shift = z - z.max() # [-2., -1., 0.]
stable = np.exp(z_shift) / np.exp(z_shift).sum()
print(stable) # [0.090 0.245 0.665]Cost at scale and what comes after
Over a vocabulary of 100,000+ tokens, computing a full softmax for every position of every training step is expensive, which has motivated approximations like hierarchical softmax and sampled softmax in older systems. Modern frameworks fuse the softmax with the cross-entropy loss (the log-softmax path) for both speed and stability. The takeaway: softmax is conceptually a two-line function, but squeezing it onto a GPU at frontier-model scale is a genuine engineering problem.
FAQ
What is the softmax function machine learning models use?
Softmax is a function that converts a vector of raw scores (logits) into a probability distribution. It exponentiates each score and divides by the total, so every output is positive and all outputs sum to 1. It is the standard bridge from a model's internal scores to interpretable probabilities.
What is the difference between softmax and argmax?
Argmax returns only the index of the single largest score with no confidence value and is not differentiable, so you cannot train through it. Softmax returns a full probability distribution over all options, is differentiable, and preserves the ranking. Models train with softmax and may use argmax at inference for a single hard pick.
When should I use softmax instead of sigmoid?
Use softmax when exactly one class should win and the probabilities must compete and sum to 1 (single-label classification, next-token prediction). Use sigmoid when classes are independent and more than one can be true at once (multi-label tasks). Softmax over two classes equals a single sigmoid.
Why do you subtract the max in softmax?
Exponentiating a large logit can overflow to infinity and break the computation. Because softmax is unchanged when you add or subtract a constant from every input, subtracting the maximum logit makes the largest exponent exp(0) = 1 and prevents overflow, while producing mathematically identical probabilities.
How does temperature change softmax?
Temperature T divides the logits before softmax. A temperature below 1 sharpens the distribution and makes the model more confident and deterministic; a temperature above 1 flattens it and increases diversity. At T = 1 you get standard softmax, and as T approaches 0 softmax approaches argmax.