ServiceNow Research · 2026-06-18 · notable
MosaicLeaks — ServiceNow benchmark for research-agent privacy leaks
ServiceNow released MosaicLeaks, a 1,001-chain benchmark that measures how much private context a research agent leaks into its web queries, plus PA-DR, an RL recipe that drops leakage from 51.7% to 9.9% on Qwen3-4B with no loss of task success.

MosaicLeaks measures how much an agent leaks into its web searches; PA-DR's privacy-aware RL cuts leakage from 51.7% to 9.9% on Qwen3-4B with no loss of task success.
Quick facts
| Maker | ServiceNow Research |
|---|---|
| Benchmark size | 1,001 multi-hop research chains (559 train / 98 val / 344 test) |
| Threat model | Private enterprise docs + controlled public web corpus |
| Base model | Qwen3-4B |
| Task-only RL leakage | 51.7% |
| PA-DR leakage | 9.9% |
| PA-DR task success | 58.7% vs 59.3% for task-only RL |
What is it?
MosaicLeaks is a benchmark from ServiceNow Research that measures privacy leakage by deep-research agents. Each of the 1,001 chains forces an agent to interleave searches over a private enterprise document set with searches against a controlled public web corpus, so leakage shows up directly in the queries the agent sends out.
How does it work?
MosaicLeaks builds each chain as a multi-hop research task with bridging entities tying the private and public sides together. The split is 559 training, 98 validation, and 344 held-out-company test chains. The companion PA-DR (Privacy-Aware Deep Research) training recipe combines situational task rewards with a learned privacy reward — judging whether the current query leaks private content directly and whether adding it to the existing query log creates a new 'mosaic' leak. The released paper reports 5-6x sample-efficiency gains from the situational rewards.
Why does it matter?
Research agents are starting to land inside enterprises, and 'the agent typed my customer list into Google' is the failure mode security teams fear. MosaicLeaks gives the field a shared scoreboard, and PA-DR shows the leakage rate is fixable — on Qwen3-4B, leakage drops from 51.7% to 9.9% while task success barely moves (58.7% vs 59.3%). Teams shipping deep-research agents can adopt PA-DR or use MosaicLeaks to red-team their own pipeline.
Who is it for?
Agent builders shipping to regulated enterprises, safety researchers, RLHF teams
Try it
https://huggingface.co/blog/ServiceNow/mosaicleaks