OpenAI · 2026-06-16 · major
OpenAI Deployment Simulation — predict misbehavior before release
OpenAI Deployment Simulation replays real past user conversations through a candidate model before launch to forecast misbehavior rates. Tested across 1.3M conversations from GPT-5 Thinking to GPT-5.4 with a median 1.5x multiplicative error.

OpenAI estimates how a new model will behave in production by replaying real past conversations through it before release.
Key specs
| Conversations replayed | 1.3M |
|---|---|
| Median multiplicative error | 1.5x |
| Detection floor | 1 in 200,000 messages |
| Eval awareness vs swe bench | 5.1% vs 100% |
What is it?
Deployment Simulation is a pre-release safety evaluation method from OpenAI. Deployment Simulation takes recent production conversations, strips the old assistant reply, and lets the candidate model answer in the same real context, then grades the new answers for undesired behavior.
How does it work?
OpenAI applied Deployment Simulation to 1.3M de-identified conversations spanning GPT-5 Thinking through GPT-5.4 collected from August 2025 to March 2026. The team pre-registered predictions for 20 misbehavior types on GPT-5.4 Thinking and compared simulated rates to post-launch production rates, reporting a median multiplicative error of 1.5x. For agentic coding, simulated tool calls lifted discriminator accuracy from 11.6% to 49.5%.
Why does it matter?
Traditional evals like SWE-Bench look obviously like tests — GPT-5.2 labels them as evaluations 100% of the time, versus 5.1% for Deployment Simulation traffic — so models can game them. Replaying real conversations reduces that evaluation-awareness gap and lets safety teams catch behavior drift, reward hacking, or new misuse patterns before the model reaches users.
Who is it for?
AI safety teams, evaluation researchers, frontier-model auditors
Try it
Read the paper: https://cdn.openai.com/pdf/predicting-llm-safety-before-release-by-simulating-deployment.pdf