AI/TLDR

What Is a Jailbreak? How People Trick LLMs into Breaking Rules

Understand what a jailbreak is, how one differs from an ordinary bug, and why no model is fully jailbreak-proof.

BEGINNER12 MIN READUPDATED 2026-06-12

In plain English

A jailbreak is a carefully crafted prompt — or a sequence of prompts — that convinces a large language model to ignore the safety rules it was trained to follow. The model was never supposed to explain how to synthesize a dangerous substance, impersonate a real person, or produce slurs on demand. A jailbreak is what gets it to do those things anyway.

Think of a theme park with a height requirement. The rule says "you must be this tall to ride." A jailbreak is the trick where someone stands on their toes, wears a tall hat, or convinces the operator the rule does not apply to them today — and gets on the ride. The ride (the model) still exists, the rules still exist, but the adversary found a gap between the rule as it was written and the rule as the system enforces it.

The key insight is that modern LLMs do not follow safety rules by checking a lookup table. They follow them because training shaped their behavior — a process called alignment, typically built on top of reinforcement learning from human feedback (RLHF) and related techniques. Alignment is powerful but imperfect. It teaches the model to refuse harmful requests in the situations the training team anticipated, but it cannot cover every possible phrasing, persona, or multi-step sequence an attacker might invent. Jailbreaks exploit the gaps.

Why it matters for AI builders

If you ship a product powered by an LLM, jailbreaks are your problem — even if you did not write one line of the underlying model. Providers like OpenAI, Anthropic, and Google invest heavily in safety training, but that training covers their baseline policies. Your application's specific rules — "never recommend a competitor," "never advise on dosages," "never reveal the system prompt" — are not in the model's training data. They live in your system prompt, and a jailbreak can talk the model into ignoring them.

The real cost of a successful jailbreak

  • Brand damage. A screenshot of your friendly chatbot saying something hateful or dangerous can go viral in minutes. The context almost never travels with the screenshot.
  • Legal and compliance risk. Medical, financial, and legal apps that produce advice the model was configured to withhold can expose you to real liability.
  • Operational damage. If your model controls tools — sending email, running code, moving money — a jailbreak that hijacks those tools is not embarrassing; it is an incident.
  • Loss of user trust. Users who see an AI product behave dangerously, even once, tend not to come back.

The broader reason jailbreaks matter is what they reveal about the state of AI safety. Every successful jailbreak is evidence that the gap between "the model was trained to be helpful and harmless" and "the model reliably behaves helpfully and harmlessly under adversarial conditions" is real and measurable. That gap is exactly what red teaming — the practice of attacking your own model deliberately — is designed to find and close.

How jailbreaks work

Every jailbreak shares the same goal — get the model to produce output it was trained to refuse — but they achieve it through very different mechanisms. The main families are social engineering, obfuscation, and context manipulation.

Social engineering — persona and roleplay

The oldest and most famous example is DAN (Do Anything Now), a prompt that asks ChatGPT to roleplay as a fictional AI that "has no rules." The attacker tries to convince the model that its safety training does not apply to the character it is playing. DAN-style attacks no longer work reliably on current frontier models, but the underlying pattern — wrap the forbidden request in fiction so the model treats refusal as breaking character — is still very much alive. Persona modulation, where the attacker asks the model to "act as a security expert," "be in developer mode," or "respond as if you were trained without restrictions," is a direct descendant.

Obfuscation — hiding the request

Safety training is stronger in some input representations than others. Keyword-level filters trained on English plaintext can often be bypassed by encoding the request differently: Base64, Caesar cipher, Morse code, Unicode homoglyphs (characters that look identical to standard letters but are different code points), strategic spacing within words, or simply asking in a low-resource language the model has less safety data for. The model still understands the encoded meaning — that is the whole point of the jailbreak — but the safety signal from training does not fire as reliably.

Context flooding — many-shot jailbreaking

As context windows expanded to hundreds of thousands of tokens, a new attack became viable: many-shot jailbreaking. The attacker fills a long context with dozens or hundreds of fabricated examples showing the model answering harmful questions without hesitation, then appends the real target question. The model, following the in-context pattern, replicates the behavior. Research published in 2024 showed that this technique scales reliably — more examples in the context leads to higher attack success rates, including on models that refuse the same request in a short context.

Multi-turn escalation — the Crescendo technique

The Crescendo attack, published in 2024 by Microsoft researchers, exploits the model's tendency to stay coherent within a conversation. The attacker starts with a completely benign message, receives a harmless reply, then takes a small step toward the target topic, leveraging the model's previous response as evidence that the direction is acceptable. Step by step, the model is walked toward content it would have refused in a single cold-start prompt. Crescendo typically reaches its goal in fewer than five turns and was demonstrated against multiple frontier models.

Adversarial suffixes — algorithmic attacks

The most technically sophisticated attacks use gradient-based optimization to find strings that, appended to a harmful request, reliably flip a model into compliance. The resulting suffixes are nonsensical to a human reader — gibberish tokens — but they hit the model in a structural weak spot that training has not hardened. Methods like GCG (Greedy Coordinate Gradient) sometimes find suffixes that transfer across different model families, meaning a suffix found against one model can work on a completely different one.

Jailbreak vs prompt injection: the critical difference

These two terms are often used interchangeably, but they describe different attacks at different layers. Knowing the difference matters when you are deciding where to build your defenses.

A jailbreak is about bypassing the model's trained values. The adversary types something into the chat interface and persuades the model to answer in a way it should refuse. The attack lives entirely in what the user says.

A prompt injection is about hijacking an application that uses an LLM as a processing engine. The adversary hides instructions inside content the model is asked to read — a webpage, a document, an email — and those instructions redirect the model's behavior. The user may not be malicious at all; the attack is in the data, not the chat input.

In agentic systems, the two attacks increasingly blur together: a prompt injection that causes an agent to abandon its task and follow adversarial instructions is also, effectively, a jailbreak of the application layer. Both need to be on your threat model.

Why no model is fully jailbreak-proof

A natural question after learning about jailbreaks is: why can't the model just be trained hard enough to make them impossible? The uncomfortable answer is that the research consensus, as of 2025-2026, is that perfect jailbreak resistance is likely unachievable with current techniques — not for lack of effort, but because of structural reasons.

The coverage problem

Safety training is built from examples: demonstrations of harmful requests paired with appropriate refusals. No training set can enumerate every possible way to phrase a request, encode it, or embed it in fiction. The input space of a language model is essentially infinite, and training can only explicitly cover the patterns the team anticipated. A creative attacker always has the advantage of novelty.

The helpfulness-safety tension

Every safety intervention trades off against usefulness. A model trained to refuse aggressively will start refusing legitimate requests — a doctor asking about drug interactions, a security researcher asking about malware, a novelist writing a villain. Providers and users have a genuine preference for useful models, which creates a ceiling on how restrictive safety training can realistically be. The attack surface is partly the gap that ceiling leaves open.

Alignment as a statistical phenomenon

LLM responses are probabilistic. Safety training makes harmful outputs unlikely, not impossible. A 2024 paper framed this formally, showing that for any reasonable definition of aligned behavior and any fixed training budget, a lower bound on the probability of successful jailbreaking can be derived — it cannot be driven all the way to zero. This is not a counsel of despair; it means the realistic goal is to make jailbreaks expensive, unreliable, and easily detected, not to achieve perfect immunity.

TechniqueWhat it hardensDoes not cover
RLHF / Constitutional AIAnticipated harmful prompt patternsNovel phrasings, encodings, long contexts
System-prompt instructionsApp-specific rules layered on topPrompts that override or ignore the system prompt
Output guardrails / classifiersCatches known bad outputs before deliveryNovel attacks that don't match known patterns
Input sanitizationFilters obvious injection attemptsSophisticated multi-turn or obfuscated attacks
Least-privilege tool designLimits blast radius if alignment failsDoes not prevent the jailbreak itself

The practical implication for builders is defense in depth: no single layer is enough, but multiple overlapping layers make a successful jailbreak that actually causes harm much harder to execute.

Going deeper

Once you understand the basics of jailbreaks, the more subtle and consequential questions open up: how do you measure your model's jailbreak resistance, how do you keep it from regressing as the model or app evolves, and what does cutting-edge research say about closing the gap permanently?

Measuring attack success rate

The key operational metric is attack success rate (ASR): the fraction of attack prompts from a standardized suite that produce a policy-violating output. A well-maintained red-team suite gives you a number you can track over time. When you patch a hole, the ASR drops. When a new model version ships, you rerun the suite before deploying. ASR is what turns "we think we're safer" into "we can prove we're safer" — or reveal that you're not.

Automated red teaming and LLM-as-attacker

Manually writing attack prompts does not scale. The modern approach uses a separate attacker LLM that is prompted to generate jailbreak variations against a target model, reads each refusal, and rewrites its attack to try to get around it. A judge LLM scores each response. This loop can run thousands of attacks per hour, covering the long tail of variations no human would think to write. Tools like Microsoft's PyRIT, NVIDIA's garak, and promptfoo's red-team mode package this pattern for teams that do not want to build it from scratch.

Emerging research: salting, activation steering, and proactive defense

Researchers are exploring defenses that go beyond prompt-level patching. LLM salting, explored by Sophos and presented at CAMLIS 2025, applies a lightweight targeted rotation to the model's "refusal direction" in activation space, disrupting the reuse of known jailbreak templates without hurting normal performance. Activation steering approaches try to detect when the model's internal representations are moving toward a "compliant with harmful request" state, before the harmful output is even generated. These are early-stage but point toward a future where safety is enforced at the representational level, not just the surface-text level.

The agentic frontier

As LLMs are wired into agents with real tools — web browsing, code execution, email, databases — the blast radius of a successful jailbreak grows dramatically. A jailbreak that makes a chatbot say something offensive is a PR problem. A jailbreak that tricks an agent into exfiltrating data, sending unauthorized emails, or executing destructive code is a security incident. The agentic setting also opens up the indirect prompt injection surface: the model can be jailbroken not through the user turn but through content it reads while doing its job. This is currently one of the most active areas in AI security research, with no fully satisfying solution yet.

Connecting back to alignment

Every jailbreak is a measurement of the gap between the model's stated values and its actual behavior under adversarial pressure. That gap is what the field of AI alignment is trying to close permanently — not just for safety reasons, but because a model whose behavior can be redirected by a clever user is, in a deep sense, not reliably aligned with anyone's intentions. Red teaming and jailbreak research are thus not a niche security concern: they are empirical feedback for the whole alignment project.

FAQ

What is a jailbreak in AI?

A jailbreak is a prompt (or a sequence of prompts) that manipulates a large language model into ignoring the safety rules it was trained to follow. Instead of refusing a harmful request, the model complies — because the jailbreak found a gap between the rule as it was trained and the way the model generalizes it to a new situation.

Does jailbreaking an LLM require hacking or coding skills?

No. The simplest jailbreaks are written in plain English and require no technical knowledge — things like asking the model to roleplay as a different AI or wrapping a forbidden request in fiction. The most sophisticated attacks (adversarial suffixes, many-shot flooding) do require technical expertise, but the everyday jailbreaks are accessible to anyone.

What is the DAN jailbreak and does it still work?

DAN ("Do Anything Now") was a roleplay prompt from 2022-2023 that asked ChatGPT to act as an AI with no restrictions. It became widely shared but was patched out of current frontier models relatively quickly. Naive DAN copy-paste no longer works reliably on models like GPT-4o or Claude 3-series, though persona-based attacks in general have not disappeared — they just require more creativity to succeed today.

What is the difference between a jailbreak and a prompt injection?

A jailbreak arrives through the user turn and persuades the model to ignore its trained values. A prompt injection arrives through data the model is asked to process (a document, a web page) and hijacks the model's actions at the application layer. Jailbreaks target the model; prompt injections target the app.

Can safety training make a model 100% jailbreak-proof?

No — current research shows this is not achievable with today's techniques. The input space is too vast to cover every possible attack phrasing, and there is a fundamental tension between refusing harmful requests and remaining useful for legitimate ones. The goal is to make jailbreaks expensive, unreliable, and unlikely to cause real harm through defense in depth, not to achieve perfect immunity.

Why should I care about jailbreaks if I'm using a big provider's API?

The provider's safety training covers their general policies — it does not know your app's specific rules. Any constraint you express only in a system prompt can potentially be bypassed by a jailbreak. You are responsible for testing your own deployment, adding output guardrails, and designing your tools with least-privilege principles so a successful jailbreak causes minimal damage.

Further reading