How Is LLM Bias Measured? Fairness for Builders

In plain English

Bias in an LLM is a systematic difference in how the model treats people or topics based on attributes like race, gender, nationality, religion, or age. It is not a single bug you can point to in the code — it is a statistical pattern baked in from training data and reinforced by fine-tuning choices. The model does not "intend" to treat groups differently; it has simply learned associations that reflect, and often amplify, the imbalances that already existed in the text it was trained on.

Is Bias in LLMs Measured — diagram — Is Bias in LLMs Measured — promptfoo.dev

A useful analogy: imagine a hiring manager who has read ten thousand job applications, mostly from one demographic group. Even without any explicit prejudice, their intuitions about what a "strong" application looks like will be calibrated to that group. Ask them to rate a novel application and they will score it against a mental template shaped by that history. An LLM trained on internet text has the same problem at massive scale — its "intuitions" about doctors, criminals, family roles, and countless other categories are shaped by who wrote what on the web.

Why it matters for builders

Bias is not just an ethics problem — it is a product quality problem. A job-screening assistant that ranks resumes differently based on whether the candidate has a typically-male or typically-female name will produce worse business outcomes. A medical Q&A tool that gives vaguer answers to questions framed from a lower-income perspective may contribute to health disparities. A code assistant that generates more cautious or less complete suggestions for prompts containing certain names will frustrate those users and erode their trust.

Why this lands specifically on builders

You inherit the model's biases. Using a foundation model through an API means you also inherit whatever bias patterns exist in that model, even if you never read the training data.
Fine-tuning can amplify bias. If your fine-tuning dataset is not carefully balanced, you can make the base model's biases worse, not better.
Regulators are paying attention. The EU AI Act and emerging US rules treat high-risk AI systems (hiring, credit, healthcare, education) as requiring bias documentation and audit trails.
Bias hurts the users most likely to already be underserved. The people most affected by biased AI outputs are often members of groups that were already under-represented in the training data — so the harm is not randomly distributed.
Post-launch fixes are costly. Retrofitting fairness after deployment is far harder than catching bias during evaluation. A structured audit before launch is cheaper.

How bias measurement works

Bias measurement sits at three levels in an LLM: the embedding layer (how the model represents words and concepts internally), the probability layer (how the model scores different token completions), and the output layer (what text the model actually generates). Each level has its own tools and limitations. Research has found that bias detected at the embedding or probability level does not always predict bias in actual generated text — so production bias audits focus primarily on outputs.

// Three levels where bias can be measured

Output textWhat the model actually says — benchmark QA, open-ended generation, toxicity scoresToken probabilitiesProbability the model assigns to stereotype-consistent vs. anti-stereotype completionsEmbeddingsWord Association Test (WEAT) — distance between group concepts and attribute concepts in vector space

Benchmark-based measurement

The most standardized approach is to run the model on a curated benchmark dataset where the correct answers are known and the test items are carefully designed to probe specific demographic groups. The three most widely used benchmarks are described in the next section. Benchmark scores are easy to compare across models and repeatable, but they only measure what the benchmark covers — a model can ace every benchmark and still exhibit bias on your specific task.

Counterfactual probing

Counterfactual fairness testing works by changing one demographic attribute in a prompt while holding everything else constant, then measuring whether the model's output changes. If the output differs — in tone, length, certainty, or content — based solely on that attribute, the model is exhibiting a bias. This is sometimes called a swap test. Attributes that get swapped in practice include names (e.g., "Jamal" vs. "Greg"), pronouns (he/she/they), occupation titles, and geographic references.

pythonpython

# Minimal counterfactual probe
prompts = [
    "Write a performance review for Emily, a software engineer.",
    "Write a performance review for Mohammed, a software engineer.",
    "Write a performance review for Wei, a software engineer.",
]

for prompt in prompts:
    response = llm.complete(prompt)
    print(response)  # compare tone, length, specificity

If the responses above differ substantially in their assessments, positivity, or specificity, those differences are attributable to the name — the only thing that changed. A bias-free model should produce responses that are statistically indistinguishable across groups when all other factors are equal.

Open-ended generation and toxicity measurement

For chatbots and open-ended applications, researchers measure whether the model generates more toxic, negative, or stereotyped text when prompted with terms associated with certain groups. The BOLD dataset (Bias in Open-ended Language Generation) provides 23,679 English prompts across five domains — profession, gender, race, religion, and political ideology — and pairs each prompt with metrics for toxicity, sentiment, and gender polarity. Tools like Google's Perspective API score generated text on dimensions including toxicity, identity attack, insult, and profanity, returning a score from 0 to 1.

The key benchmarks and what they test

No single benchmark covers all bias types. The standard practice is to run several that complement each other. Here are the most commonly cited ones:

Benchmark	Format	What it tests	Bias categories
BBQ	Multiple-choice QA	Whether models answer stereotype-consistent when context is ambiguous, and correctly when disambiguated	Age, disability, gender, nationality, race, religion, sexual orientation, socioeconomic status, physical appearance
StereoSet	Fill-in-the-blank	Model's preference for stereotype vs. anti-stereotype vs. irrelevant completions	Gender, profession, race, religion
CrowS-Pairs	Paired sentence scoring	Model assigns higher probability to stereotyped vs. counter-stereotyped sentences	Race, gender, religion, age, disability, nationality, sexual orientation, appearance, socioeconomic
WinoBias	Coreference resolution	Whether pronoun resolution follows gender stereotypes for occupations	Gender (occupation stereotypes)
BOLD	Open-ended generation	Sentiment, toxicity, and gender polarity in free-text completions about different groups	Profession, gender, race, religion, politics
RealToxicityPrompts	Open-ended generation	Rate at which models produce toxic outputs when prompted with naturalistic sentence beginnings	Toxicity, identity attacks, profanity

BBQ in detail

The Bias Benchmark for QA (BBQ) is the most-cited benchmark for measuring social biases in LLMs via question answering. It contains 58,492 unique examples built from templates across nine socially sensitive categories. Each question has three answer choices: one that reflects a stereotype (the biased answer), one that challenges the stereotype, and "Unknown" — the correct answer when context is ambiguous. BBQ is designed to catch two distinct failure modes: (1) the model picks the stereotyped answer when context is ambiguous, and (2) the model ignores explicit disambiguating context and still picks the stereotyped answer.

The limits of benchmarks

Published benchmarks have real weaknesses. They are largely US-centric and English-language. Models are increasingly trained on data that includes the benchmarks themselves, so high scores may reflect memorization rather than genuine fairness. And as researchers build benchmarks for specific cultural contexts — KoBBQ for Korean, BasqBBQ for Basque, Dutch CrowS-Pairs — it becomes clear that bias patterns differ significantly across languages and cultures. A model that performs well on US-focused benchmarks may exhibit strong biases when deployed in a different cultural context.

Fairness metrics: what the numbers actually mean

Beyond benchmarks, builders working on classification or ranking tasks need quantitative fairness metrics to measure whether their model's predictions are equally good across groups. The three most important metrics — and why you cannot maximize all three simultaneously — are explained below.

// Three core fairness metrics

Demographic Parity

What: equal positive-prediction rate across groups
Formula: P(Y=1 | A=0) = P(Y=1 | A=1)
Use when: no ground-truth labels available
Limitation: ignores whether predictions are correct

Equal Opportunity

What: equal true-positive rate across groups
Formula: TPR(A=0) = TPR(A=1)
Use when: the cost of false negatives is the key concern
Limitation: requires ground-truth labels

Calibration

What: predicted probabilities match actual outcomes equally across groups
Formula: P(Y=1 | score=s, A=0) = P(Y=1 | score=s, A=1)
Use when: the model outputs a score used for decisions
Limitation: mathematically incompatible with demographic parity when base rates differ

For LLMs used in open-ended generation rather than classification, these metrics do not apply directly. Instead, builders measure proxy metrics: sentiment disparity (does the model produce more negative text about one group?), toxicity disparity (does the model generate more toxic content when one group is mentioned?), and representation disparity (does the model associate one group with a narrower range of roles or attributes?).

Running a basic bias audit on your app

You do not need a research team to do a meaningful bias audit. A structured process with standard tools catches the most common problems before launch.

Identify the sensitive attributes relevant to your use case. For a hiring tool: gender, race, age. For a medical Q&A: race, socioeconomic status, language background. For a content moderation system: religion, political affiliation. Write these down before you start — they determine what you test.
Build counterfactual test cases. Take 20-50 of your real user prompts and create variants that swap one attribute at a time. Run both versions through the model and compare outputs for tone, length, completeness, and sentiment.
Run relevant benchmarks. If your task is QA, run BBQ on the categories relevant to your use case. If your app generates open-ended text, run BOLD-style prompts and score outputs with Perspective API or a sentiment classifier.
Measure subgroup performance. If you have labeled test data, compute accuracy (or your task metric) separately for each demographic group. A 5-percentage-point gap is a common threshold for flagging disparate impact.
Log results and define gates. Document your findings. Set thresholds: "We will not ship if any subgroup accuracy is more than 8 points below the overall average" or "toxicity disparity across groups must stay below 0.05." Make the thresholds explicit before you measure.
Re-audit after every major prompt or model change. Bias characteristics can shift when you change the system prompt, switch model versions, or add retrieval-augmented content. Treat the audit as a recurring check, not a one-time box.

pythonpython

# Example: simple demographic parity check for a classifier app
from collections import defaultdict

results = defaultdict(list)  # {group: [predicted_positive (0/1)]}

for item in test_set:
    prediction = llm_predict(item["text"])  # 0 or 1
    results[item["demographic_group"]].append(prediction)

for group, preds in results.items():
    rate = sum(preds) / len(preds)
    print(f"{group}: positive rate = {rate:.3f}")

# Demographic parity holds if all rates are similar.
# A gap > 0.05-0.10 is worth investigating.

Going deeper

The field of LLM bias measurement is moving fast, and several important tensions remain unresolved. The first is the benchmark contamination problem: as models are trained on ever-larger web corpora, they increasingly memorize benchmark examples. A model that has "seen" the BBQ templates during training can achieve a high score by retrieval rather than by genuine fairness. Researchers are responding by building dynamic benchmarks that use templates the model cannot have memorized, but this is an ongoing arms race.

The second tension is between individual fairness (similar individuals should be treated similarly) and group fairness (statistical outcomes should be equal across groups). These can conflict: a model calibrated to produce equal outcomes across demographic groups may produce inconsistent outcomes for specific individuals within those groups. Most published benchmarks measure group fairness; individual fairness is harder to operationalize but arguably more relevant for high-stakes decisions.

Intersectionality: where single-attribute tests miss the hardest cases

Single-attribute counterfactual tests — swap the name, measure the change — only catch one dimension at a time. In practice, bias is often intersectional: a model might treat women fairly in aggregate, and treat people of color fairly in aggregate, but treat women of color unfairly in ways that neither single-attribute test would catch. Measuring intersectional bias requires test cases that vary multiple attributes simultaneously, which grows the test matrix exponentially. The CEB (Compositional Evaluation Benchmark for Fairness) benchmark, published in 2024, was designed specifically for this: it systematically combines multiple social categories and task types to surface intersectional bias patterns.

The metric choice is a values choice

Perhaps the deepest lesson from fairness research is that there is no neutral choice of fairness metric. Demographic parity, equal opportunity, and calibration encode different answers to the question "what does fair treatment mean?" — and those answers have political and ethical dimensions. A model that satisfies calibration is saying "given the same predicted probability, outcomes should be equal" — which accepts existing base-rate differences as input. A model that enforces demographic parity is saying "the output distribution should be equal regardless of base rates" — which intervenes to produce equal outcomes. Neither is objectively correct. The right choice depends on the harm you are trying to prevent and the values you bring to the problem. Documenting which fairness definition you chose and why is part of a rigorous audit.

Finally, bias measurement is only half the picture — the other half is mitigation. The main mitigation strategies are pre-processing (balancing the training data), in-processing (adding fairness constraints to the loss function during training), and post-processing (adjusting model outputs or decision thresholds at inference time). Each strategy has trade-offs against task performance: reducing bias in outputs often comes with some cost to accuracy on the majority group, and the right trade-off depends on the application context. Understanding the measurement tools described in this article is the prerequisite for having that conversation clearly.

FAQ

What is the BBQ dataset and why do researchers use it?

BBQ (Bias Benchmark for QA) is a dataset of 58,492 multiple-choice questions designed to probe whether LLMs answer using stereotypes when context is ambiguous. Each question has a stereotyped answer, an anti-stereotyped answer, and "Unknown" — the correct choice when not enough information is given. Researchers use it because it covers nine socially sensitive categories and separates two distinct bias failure modes: bias under ambiguity and bias despite explicit clarifying context.

What is counterfactual fairness testing and how do I run it?

Counterfactual fairness testing swaps one demographic attribute in a prompt while keeping everything else identical, then checks whether the model's output changes. To run it: take a real prompt from your app, create two or more versions that differ only in a name, pronoun, or demographic reference, send each through the model, and compare the outputs for differences in tone, length, or content. Systematic differences point to a bias tied to that attribute.

Can a model pass bias benchmarks and still be biased in my app?

Yes. Standard benchmarks are mostly US-centric and English-language, and they only cover the bias categories they were designed to test. A model can score well on BBQ while still exhibiting bias on a task the benchmark does not cover, or when deployed in a non-US cultural context. Benchmarks are a floor, not a ceiling — a good score rules out only the specific patterns the benchmark tests for.

What is demographic parity and when should I use it?

Demographic parity requires that the model produces positive predictions at equal rates across demographic groups. It is the right metric when you do not have ground-truth labels and want to check for disparate impact in raw output rates. Its limitation is that it does not verify whether the predictions are correct — it only verifies that they are equally distributed. If base rates genuinely differ across groups, enforcing demographic parity means making some predictions worse.

Does fine-tuning my own model reduce or increase bias?

It depends entirely on your fine-tuning data. Fine-tuning on a carefully balanced, representative dataset can reduce biases from the base model. Fine-tuning on data that over-represents one group or includes stereotyped examples can amplify biases significantly. Always audit your fine-tuning dataset before training and re-run your bias evaluation suite after fine-tuning, not just before.

What is the difference between bias and toxicity in LLMs?

Toxicity is content that is harmful, offensive, or abusive — slurs, threats, graphic violence. Bias is a systematic difference in how the model treats groups that may not be explicitly harmful in any single output but produces unfair patterns at scale. A model can be biased without being toxic (e.g., consistently giving vaguer medical advice to certain groups) and can be toxic without being specifically biased (e.g., generating offensive content indiscriminately).

How Is Bias in LLMs Measured? Fairness Basics for Builders

In plain English

Why it matters for builders

Why this lands specifically on builders

How bias measurement works

Benchmark-based measurement

Counterfactual probing

Open-ended generation and toxicity measurement

The key benchmarks and what they test

BBQ in detail

The limits of benchmarks

Fairness metrics: what the numbers actually mean

Running a basic bias audit on your app

Going deeper

Intersectionality: where single-attribute tests miss the hardest cases

The metric choice is a values choice

FAQ

Further reading

// In plain English

// Why it matters for builders

Why this lands specifically on builders

// How bias measurement works

Benchmark-based measurement

Counterfactual probing

Open-ended generation and toxicity measurement

// The key benchmarks and what they test

BBQ in detail

The limits of benchmarks

// Fairness metrics: what the numbers actually mean

// Running a basic bias audit on your app

// Going deeper

Intersectionality: where single-attribute tests miss the hardest cases

The metric choice is a values choice

// FAQ

// Further reading

// Related

In plain English

Why it matters for builders

How bias measurement works

The key benchmarks and what they test

Fairness metrics: what the numbers actually mean

Running a basic bias audit on your app

Going deeper

FAQ

Further reading

Related