Simon Willison · 2026-06-22 · notable
Simon Willison: 'Prompt Injection as Role Confusion'
Simon Willison highlights a new paper by Charles Ye, Jasmine Cui, and Dylan Hadfield-Menell arguing prompt injection is really 'role confusion' — language models lean on style cues, not content, to tell trusted text from user input.
Simon Willison reframes prompt injection as a deeper role-perception bug rather than a parsing problem.
What is it?
Simon Willison's June 22, 2026 post unpacks a new paper, 'Prompt Injection as Role Confusion' by Charles Ye, Jasmine Cui, and Dylan Hadfield-Menell. The piece argues that today's models cannot reliably tell privileged system text from untrusted user input because they key on formatting and tone, not on whose words they actually are.
How does it work?
The cited paper runs experiments where attackers shape user input to mimic the format and reasoning style of system messages — and models, including gpt-oss-20b, treat the imitation as trusted. Simon Willison walks through the role-confusion framing, then connects it back to why so many prompt-injection defenses that depend on delimiters or tags continue to fall.
Why does it matter?
Role confusion explains why patching individual injection patterns never seems to settle the problem — the underlying signal models use to distinguish trusted from untrusted text is unreliable. For anyone shipping LLM agents, Simon Willison's writeup is the fastest way to understand why robust defenses will need a real fix to role perception, not better string filters.
Try it
https://simonwillison.net/2026/Jun/22/prompt-injection-as-role-confusion/