Why Do Models Refuse? Refusals and Over-Refusals Explained

Q: What benchmarks measure over-refusal in LLMs?

The main ones are **OR-Bench** (80,000 prompts, ICML 2025), **XSTest** (250 manually written prompts that mimic unsafe phrasing without being unsafe), and **FalseReject** (large-scale pseudo-toxic prompts with reasoning annotations). OR-Bench is the most comprehensive and includes an automated pipeline for continuous updates, making it less susceptible to models overfitting the static test set.

Understand why models refuse, where over-refusal comes from, and what builders can do about false positives.

BEGINNER12 MIN READUPDATED 2026-06-12

In plain English

A refusal is what happens when a language model declines to answer a request — sometimes with a polite hedge, sometimes with a flat "I can't help with that." Refusals exist because every frontier model ships with safety training baked in: the lab has trained it to avoid generating content that could be harmful, illegal, or against their policies. That is mostly a good thing. The problem is that safety training is imprecise, and models sometimes refuse requests that are completely benign. That failure mode is called over-refusal — and it is the tension every AI product builder eventually hits.

Models Refuse — diagram — Models Refuse — redhat.com

Think of a bouncer at a venue with a strict door policy. The policy says: "no weapons." The bouncer, applying that rule with limited context, turns away a chef who is carrying a set of kitchen knives to a catering gig upstairs. The knives are real. The danger is not. The bouncer applied a surface-level pattern match — knives → weapons → deny — without understanding the full context. LLMs do something structurally similar: they learned patterns that correlate with harm during training, and they fire those patterns even when the context would make the request clearly safe for a thoughtful reader.

Why it matters

For end users the immediate cost of over-refusal is frustration: the model refuses to help write a thriller novel chapter, explain how a historical atrocity happened, or look up medication dosage tables — tasks that any competent librarian would assist with without hesitation. The model treats intent as guilty until proven innocent, and the user has no obvious way to prove their intent.

For builders, over-refusal is a reliability and trust problem. If your medical information product refuses to explain drug interactions, or your legal-research tool refuses to summarize case law about violent crimes, the product fails its users at exactly the moments it matters most. Over-refusal is not a safe default — it has real costs. Research on Claude-3 found that despite Anthropic's stated goal of reducing over-refusal, the model had measurably higher over-refusal rates than other frontier models on large-scale benchmarks of safe-but-surface-level-suspicious prompts.

The safety-helpfulness tradeoff

Labs face a genuine dilemma: optimize hard for safety and your model starts acting like a liability-averse lawyer who adds disclaimers to everything. Optimize hard for helpfulness and you ship a model that assists with genuinely harmful requests. There is no setting that eliminates both failure modes simultaneously. The goal of modern safety research is to push the Pareto frontier — to build models that refuse less on safe requests while refusing more reliably on harmful ones — but no model has solved it completely.

How it works

Refusal behavior is not a simple keyword filter bolted onto the output. It is the result of the training pipeline itself — the model has learned, across billions of examples and feedback signals, to associate certain kinds of requests with a refusal response. Understanding how that happens requires understanding the three layers where safety behavior is instilled.

Layer 1: Supervised fine-tuning on curated examples

After pretraining, a base model is fine-tuned on human-curated conversations that demonstrate the desired behavior. This dataset includes many examples of the model appropriately refusing harmful requests — and those examples teach the model the form of refusals: the phrases, the hedges, the gentle redirects. This is where the stylistic fingerprint of refusals gets baked in.

Layer 2: RLHF — the reward model learns to prefer safe outputs

Reinforcement Learning from Human Feedback (RLHF) trains a reward model on human preference rankings, then uses that reward model to fine-tune the LLM further via a reinforcement-learning step (typically PPO). Human raters reliably penalize outputs that seem harmful or irresponsible — so the reward model learns that refusals on ambiguous prompts score well. The LLM then optimizes toward that reward, which can mean refusing more broadly than the raters actually intended.

Layer 3: Constitutional AI and synthetic preference data

Anthropic's Constitutional AI method replaces some human annotation with a written "constitution" — a list of principles the model critiques its own outputs against. A helper model generates a response, a critic model checks it against the constitution, and the revised output forms a synthetic preference signal. This is sometimes called RLAIF (RL from AI Feedback). Constitutional AI is designed to encourage reasoned refusals over blanket refusals, but the quality of the balance depends heavily on what the constitution says and how the critique loop is calibrated.

// How refusal behavior enters a model

PretrainingBase model learns language from the internetSupervised fine-tuningCurated examples show how to refuse harmful requestsReward model trainingHuman raters penalize unsafe or irresponsible outputsRL fine-tuning (RLHF / RLAIF)Model updates weights to score well on the reward modelDeployed modelRefusal behavior is now baked into the weights

The refusal direction: a single linear feature

Research published at NeurIPS 2024 by Arditi et al. found something surprising about how refusals are stored mechanistically: refusal behavior is mediated by a single direction in the model's residual stream — a one-dimensional subspace in activation space — across 13 tested open-source chat models up to 72B parameters. Erasing that direction (directional ablation) removed refusal capability almost entirely; injecting it into a harmful prompt caused the model to refuse even benign completions. This linear feature generalizes across languages. The practical implication: refusal is not a complex multi-step reasoning process in most models — it is a relatively simple feature that fires when the prompt's representation is sufficiently "near" the refusal direction in activation space. Over-refusal happens when benign prompts project onto that direction because they share surface vocabulary or topic with harmful ones.

What causes over-refusal

Over-refusals tend to cluster around a few predictable patterns. Knowing them helps you design prompts and systems that avoid triggering them unnecessarily.

Pattern	Example benign request	Why it triggers refusal
Surface keyword match	"How do poisons work in Agatha Christie novels?"	The word "poison" activates the refusal direction regardless of the fictional/literary context.
Topic association	"Explain the chemistry behind fireworks"	Explosives-adjacent vocabulary correlates with harmful requests in training data.
Dual-use information	"What are common household chemicals you shouldn't mix?"	Safety information about dangerous combinations is indistinguishable to the model from instructions to create them.
Professional context not stated	"What is a lethal dose of acetaminophen?"	A toxicologist, nurse, or concerned parent all send identical text; the model assumes worst-case intent.
Historical or journalistic framing	"Describe the methods used by the Nazis in detail"	Accurate historical description shares vocabulary with content that glorifies atrocities.
Persona and fiction prompts	"Write a villain who explains their plan"	Fictional wrapper does not reliably reduce the model's perceived risk of the underlying content.

The underlying cause in all cases is context insensitivity: the model is doing a coarse similarity match between the prompt and the patterns it associates with harm, rather than reasoning about the full situational context. A human expert reading any of the requests above would immediately recognize the benign interpretation. The model weights the surface signal heavily because that is what the reward model was trained on.

How builders reduce false positives

If you are shipping a product on top of a foundation model and over-refusal is hurting your users, you have several levers — some are prompt-level, some require model-level changes, and some require infrastructure.

1. System prompt context (operator instructions)

Most frontier model APIs distinguish between the operator (you, the developer) and the user (your end user). Content delivered in the system prompt carries more trust than content in the user turn. If your product is a medical reference tool, saying so explicitly — and describing who the users are — gives the model legitimate context to apply a different safety calibration. A system prompt like "This assistant is used by licensed healthcare professionals for clinical reference. Users may ask about dosages, drug interactions, and overdose thresholds." can substantially shift the model's willingness to engage with dual-use medical information.

Example system prompt for a medical reference tooltext

You are a clinical reference assistant used by licensed physicians and pharmacists.
Users are credentialed healthcare professionals with a legitimate need for precise
pharmacological information, including dosing ranges, overdose thresholds, and
drug interaction data. Provide accurate, detailed information without paternalistic
disclaimers. Refer users to poison control only when they explicitly describe an
emergency situation.

2. Prompt framing and specificity

Vague prompts that brush past dangerous-sounding keywords without context are most likely to be refused. Adding explicit purpose, audience, or framing — "for a graduate history course", "to understand the safety precautions to take", "for a published novel where the character..." — shifts the prompt's projection in the model's activation space. This is not jailbreaking; it is giving the model the context it needs to make a better inference about intent.

3. Measuring refusals with evals before shipping

A dedicated over-refusal eval suite — a set of prompts representing the realistic requests your users will send, labeled "should be answered" — catches over-refusal before it reaches production. You can track your false-positive rate across model versions and system prompt changes. Without this, you are flying blind: internal safety testing only catches false negatives, never false positives.

4. Model selection and fine-tuning

Different models have different over-refusal profiles, and those differences are measurable on benchmarks like OR-Bench and XSTest. If the base model you are using refuses too aggressively for your use case, you can evaluate alternatives, or fine-tune on domain-specific data that teaches the model what "appropriate engagement" looks like in your context. Preference optimization techniques (like DPO) applied on your own labeled dataset of over-refusal cases can directly shift the balance.

Going deeper

For practitioners who want to go beyond prompt engineering, several advanced techniques exist for directly controlling refusal behavior at the model level.

Activation steering

Because refusal is mediated by a single direction in activation space, it is possible to steer that direction at inference time — injecting or subtracting from it without changing the model weights. Research into "configurable refusal" constructs category-specific steering vectors that can dial refusal up or down for particular topic domains independently. This is an emerging technique mostly used in research contexts, but it foreshadows a future where operators can configure refusal sensitivity programmatically per request category rather than relying entirely on prompt engineering.

Inference-time activation energy methods

A complementary approach detects over-refusal at inference time by analyzing the energy of the model's internal activations at the point where a refusal is about to be generated. If the activations suggest the model is in a "false alarm" refusal state rather than a genuine harm state, the generation can be steered back toward helpfulness. This avoids the blunt-instrument problem of activation steering (which can remove too much safety behavior) by only intervening when a false positive is likely.

Constitutional and preference-based re-calibration

Labs can update the constitution or the preference dataset to explicitly penalize over-refusals. The SafeConstellations method (2025) introduced task-specific safety trajectories — the model learns a different refusal policy depending on the task context, rather than applying a global policy uniformly. This is the direction the field is moving: contextual refusal policies rather than topic-level ones.

The dual newspaper test

Anthropic has described a useful mental model for calibrating refusals: ask whether a response would be criticized by a journalist writing about AI harms (bad output) — but also whether it would be criticized by a journalist writing about AI paternalism and over-refusal (unnecessary refusal). A well-calibrated model should pass both tests. Applied to your own product evals, this framing helps teams avoid the common trap of treating every refusal as inherently safe.

// Correct refusal vs. over-refusal

Correct refusal

Request is genuinely harmful
No plausible benign interpretation
Refusal reduces expected harm
User cannot reframe legitimately
Consistent across all phrasings

Over-refusal

Request is benign in context
Surface vocabulary triggers safety pattern
Refusal hurts users with legitimate needs
Reframing with context resolves the refusal
Sensitive to wording, not to actual intent

Refusal behavior will continue to evolve as labs develop more sophisticated alignment techniques. The current era — where refusal is a coarse linear feature fired by surface-level pattern matching — is likely a transitional state. Future models with stronger reasoning capabilities and more granular safety policies may be able to weigh the full situational context before deciding, rather than relying on shallow heuristics. Until then, understanding the mechanics described here gives you the tools to work around the limitations at the product and prompt level.

FAQ

Why does ChatGPT refuse my request when it seems completely harmless?

The model is pattern-matching your request against topics it learned to associate with harm during training — not reasoning from first principles about your intent. If your request shares vocabulary or subject matter with genuinely harmful prompts, the model's safety features fire even when your actual intent is benign. Adding context about your purpose, profession, or use case often resolves this.

What is the difference between a refusal and a content filter?

A content filter is an external classifier that runs separately from the model and blocks inputs or outputs that match a list of banned patterns. A refusal is the model itself declining to answer — it is trained behavior, not an external gate. Most production AI products use both: the model can refuse from its training, and additional filters may block content the model missed. The two can conflict: a model might be willing to help, but an upstream filter blocks the request before the model ever sees it.

Can I reduce over-refusals by telling the model to ignore its safety training?

No — and doing so is likely against the terms of service of every major API provider. More importantly, it is the wrong framing. The goal is not to disable safety behavior but to give the model accurate context so it applies safety behavior appropriately. A system prompt that describes your legitimate use case and user base is the sanctioned, effective way to shift refusal calibration for your application.

Do different models have different over-refusal rates?

Yes, meaningfully so. Benchmarks like OR-Bench (80,000 safe-but-surface-level-suspicious prompts) show measurable differences across frontier models. Claude-3 models showed higher over-refusal rates than other frontier models on these benchmarks despite Anthropic's stated goal of reducing them. Over-refusal rates are also not static — they change with each model version as labs tune the safety-helpfulness balance.

Is over-refusal the same as the model being more ethical?

No. Over-refusal is a calibration error, not a sign of stronger ethics. Refusing to explain how historical atrocities happened, declining to describe medication interactions for a nurse, or blocking a novelist from writing a villain does not protect anyone — it just fails legitimate users. A well-calibrated model refuses harmful requests reliably, not requests that superficially resemble harmful ones.

What benchmarks measure over-refusal in LLMs?

The main ones are OR-Bench (80,000 prompts, ICML 2025), XSTest (250 manually written prompts that mimic unsafe phrasing without being unsafe), and FalseReject (large-scale pseudo-toxic prompts with reasoning annotations). OR-Bench is the most comprehensive and includes an automated pipeline for continuous updates, making it less susceptible to models overfitting the static test set.

// In plain English

// Why it matters

The safety-helpfulness tradeoff

// How it works

Layer 1: Supervised fine-tuning on curated examples

Layer 2: RLHF — the reward model learns to prefer safe outputs

Layer 3: Constitutional AI and synthetic preference data

The refusal direction: a single linear feature

// What causes over-refusal

// How builders reduce false positives

1. System prompt context (operator instructions)

2. Prompt framing and specificity

3. Measuring refusals with evals before shipping

4. Model selection and fine-tuning

// Going deeper

Activation steering

Inference-time activation energy methods

Constitutional and preference-based re-calibration

The dual newspaper test

// FAQ

// Further reading

// Related

In plain English

Why it matters

How it works

What causes over-refusal

How builders reduce false positives

Going deeper

FAQ

Further reading

Related