Shadow Mode: Test a New Model on Real Traffic

Q: Does shadow mode require double the API budget?

Yes, during the shadow period you are making two inference calls per shadowed request. To control cost, shadow only a fraction of traffic (10–20% is common) rather than 100%. The spend is temporary and bounded — a typical shadow run lasts 24–72 hours — and is far cheaper than a bad rollout.

Q: Can I use shadow mode to test a completely different model provider?

Absolutely. Shadow mode is provider-agnostic. You can mirror traffic from a GPT production endpoint to a Claude or Gemini candidate, or from a cloud-hosted model to a self-hosted open-weight model. The only requirement is that both endpoints accept the same request format — if they differ, add a thin adapter layer before the candidate call.

Q: How is shadow mode different from A/B testing?

In shadow mode, users never see the candidate's output, so you cannot measure user behaviour. A/B testing splits live traffic so that real users see one variant or the other, which lets you measure downstream outcomes like engagement or satisfaction. Shadow mode validates technical correctness and quality before any user exposure; A/B testing measures user response after you are confident the candidate is production-safe.

Q: How long should a shadow run last?

Long enough to accumulate a statistically meaningful sample across your traffic patterns. For most services, 24–48 hours at a 10–20% sampling rate is sufficient. If your traffic has strong day-of-week variation (e.g., B2B tools that spike on weekdays), run the shadow for at least one full business cycle — typically 5–7 days — to cover the full distribution.

Q: What happens if the candidate model errors out during the shadow run?

Nothing visible to users — that is the whole point. The error is logged, the production model's response is returned to the user as normal, and the error rate in the shadow logs becomes a data point in your go/no-go analysis. A high error rate in shadow mode is exactly the kind of regression you want to catch before promoting the candidate.

Q: Is shadow mode safe to use with LLM agents that call external tools?

Only if you stub out or sandbox the tool calls. A shadow agent must never write to a production database, call a payment API, or send emails — it should only predict what it *would* do without actually doing it. Build a mock tool layer for the shadow path, or use a sandboxed environment with fake credentials.

In plain English

Shadow mode is a deployment technique where a new model — the candidate — receives the exact same live requests as your production model, but its responses are never shown to users. The production model answers users as normal. The candidate runs silently in the background, its outputs logged and compared to the production model's. Users notice nothing. You learn everything.

Shadow Mode — diagram — Shadow Mode — huggingface.co

The analogy that makes it click: think of a flight simulator running in parallel with a real cockpit. The trainee co-pilot (the candidate model) sees every instrument reading and makes every decision, but their hands are not on the controls. The instructor (your team) watches the trainee's decisions alongside the real pilot's and decides whether the trainee is ready to fly solo — without ever putting passengers at risk.

In LLM terms, the flight simulator is a mirrored request pipeline. Every inbound request is duplicated: one copy goes to your live model, which returns its response to the user; an identical copy is dispatched asynchronously to the candidate model. Both responses and their associated metadata (latency, token counts, error codes) are written to a log store. Analysis happens offline, after the fact, with no user-facing impact whatsoever.

Why it matters

Switching models is one of the highest-risk operations in a production LLM system. Eval datasets, no matter how carefully curated, never perfectly capture the full distribution of live traffic. A candidate that aces your benchmark may still behave unexpectedly on the long tail of edge cases that real users send every day.

Shadow mode solves this by letting you stress-test the candidate against the real input distribution before any user ever sees its output. The benefits cluster into three categories:

Zero user risk. The candidate's responses are invisible to users. Even if the candidate produces garbled, harmful, or wildly incorrect output, no user is affected.
Real distribution coverage. Live traffic includes rare prompts, unusual languages, adversarial inputs, and edge cases that synthetic eval sets routinely miss. Shadow mode is the only way to measure performance on all of them.
Baseline before commitment. You establish an objective quality, latency, and cost baseline for the candidate before you have to make any go/no-go decision. That baseline is what makes the canary or full rollout that follows trustworthy.
Regression detection. If the candidate is worse than production on any metric — even a metric you didn't anticipate — shadow logs will surface it. You can catch regressions you didn't know to test for.

The economics matter too. Running a shadow for 24–48 hours on a sampled slice of traffic (say, 10–20% of requests) is far cheaper than a botched rollout that degrades user experience, triggers refund requests, or triggers escalations to your on-call team at 2 a.m.

How it works

The core mechanics are straightforward: duplicate the request, dispatch asynchronously, log both responses, compare offline. The engineering details determine whether the shadow run gives you trustworthy signal.

// Shadow mode request flow

Inbound requestUser sends a prompt to your APITraffic mirrorGateway duplicates the request (async copy to candidate)Production modelResponds to user as normal; response loggedCandidate modelReceives identical input; response logged silentlyLog storeBoth responses, latency, token counts written togetherOffline analysisLLM judge or human reviewers compare paired responses

Where to insert the mirror

The mirror can live at several layers of your stack, each with tradeoffs:

Layer	How it works	Best for
API gateway / proxy	The gateway (e.g. an LLM gateway, Nginx, Envoy) clones the request before forwarding to production	Teams with a centralised gateway; minimal application code change
Application code	The app explicitly calls both models and discards the candidate's response before returning	Maximum control; works without infra changes
Service mesh (Istio, Envoy)	The mesh mirrors TCP traffic at the sidecar level before it reaches the model endpoint	Kubernetes-native deployments; works transparently across services
Platform feature (SageMaker)	Managed shadow variant on the same endpoint; AWS handles the mirroring and logging	Teams already on SageMaker; lowest operational overhead

Sampling rate

You do not need to shadow 100% of traffic. Starting at 10–20% is sensible: it controls extra cost, limits load on the candidate endpoint, and still accumulates enough samples for statistical comparison within hours on a moderately trafficked service. Once you are confident the candidate is stable, you can raise the sampling rate to 100% to maximise coverage of edge cases before your canary.

What to log

The full request payload (prompt, system message, parameters)
The production model's response text and finish reason
The candidate model's response text and finish reason
Latency for both: time-to-first-token (TTFT) and total duration
Token counts (prompt tokens, completion tokens) for cost estimation
Error codes and HTTP status for both
A shared request_id to join the two rows in analysis

Comparing shadow outputs

Collecting logs is the easy half. Turning logs into a go/no-go decision is the hard half. There are three complementary approaches, and production teams typically use all three together.

Structural metrics (automatic)

These require no human judgment and run in real time or near-real time:

Error rate — does the candidate refuse, time out, or return malformed JSON at a higher rate than production?
Latency percentiles — compare p50, p95, p99 TTFT and total duration. A faster candidate is great; a slower one that will hurt streaming UX is a blocker.
Token counts — completion length differences matter for cost. A candidate that consistently generates 40% longer responses may cost more despite a lower per-token price.
Format compliance — if your system prompt instructs the model to respond in JSON, does the candidate comply as reliably as production?

LLM-as-judge (scalable quality signal)

For quality comparison, the most scalable approach is to send paired responses — (production output, candidate output) — to a judge model (often a capable, frontier-tier model like Claude or GPT) and ask it to rate which response is better on dimensions you care about: correctness, groundedness, helpfulness, tone, safety.

According to LangChain's 2025 State of AI Agents survey, 53% of teams with deployed agents already use LLM-as-judge for automated evaluation, and research shows sophisticated judge models can align with human preferences at roughly 85% agreement — higher than inter-human agreement (81%). The key implementation discipline is randomising the order of the two responses before presenting them to the judge to avoid position bias, which can flip verdicts in 10–30% of comparisons when left uncorrected.

pythonpython

import anthropic
import random

client = anthropic.Anthropic()

def judge_pair(prompt: str, prod_response: str, cand_response: str) -> dict:
    """Ask a judge model to compare two responses. Order is randomised."""
    responses = [("A", prod_response), ("B", cand_response)]
    random.shuffle(responses)  # Avoid position bias
    label_a, text_a = responses[0]
    label_b, text_b = responses[1]

    judge_prompt = f"""
User prompt: {prompt}

Response {label_a}: {text_a}

Response {label_b}: {text_b}

Which response is better? Reply with a JSON object:
{{"winner": "A" or "B", "reason": "one sentence"}}
"""
    result = client.messages.create(
        model="claude-opus-4-8",
        max_tokens=256,
        messages=[{"role": "user", "content": judge_prompt}],
    )
    import json
    verdict = json.loads(result.content[0].text)
    # Remap winner back to prod/candidate
    winner_label = verdict["winner"]
    winner = "prod" if winner_label == label_a and label_a == "A" else "candidate"
    return {"winner": winner, "reason": verdict["reason"]}

Human review (ground truth for high-stakes decisions)

For a final go/no-go decision on a high-traffic or safety-sensitive feature, human reviewers should evaluate a random sample of paired responses — typically 100–300 pairs is enough to detect a meaningful quality difference. Human review is slow and expensive, which is why you use it only for the final gate, after structural metrics and the LLM judge have already filtered out the clearly bad candidates.

Defining release gates before you start

Decide your pass/fail thresholds before the shadow run begins, not after you see the numbers. Typical gates:

Error rate delta <= +0.5 percentage points vs. production
p99 latency delta <= +200 ms
LLM judge win rate for candidate >= 50% (parity) or >= 55% (clear improvement)
Format compliance rate >= production rate
Cost per request delta within budget (e.g., <= +20%)

Shadow mode vs. canary release

Shadow mode and canary releases are often confused because both involve running two model versions simultaneously on production traffic. The difference is fundamental: in shadow mode, users never see the candidate's output. In a canary release, a small fraction of real users see the candidate's output and experience its effects.

// Shadow mode vs. canary release

Shadow mode

Candidate output logged, never served
Zero user impact even if candidate fails badly
No user behaviour signal (no clicks, ratings, or conversions)
Best for: early validation, catching crashes and regressions
Typical duration: 24-72 hours

Canary release

Candidate output served to 1-10% of real users
Real user impact if candidate is broken
Captures downstream business metrics (engagement, retention)
Best for: final pre-rollout validation after shadow passes
Typical duration: 24 hours to 2 weeks

The recommended sequencing in modern LLMOps is: offline eval → shadow mode → canary → full rollout. Each stage builds confidence before the next. Shadow mode de-risks the canary by ensuring the candidate is not fundamentally broken before any user is exposed to it. If your candidate passes shadow gates, you can run a canary at a meaningful traffic percentage (5–10%) without the fear that you are gambling on unknown behaviour.

Common pitfalls

Shadow mode looks simple in diagrams but has several practical failure modes that teams repeatedly encounter.

Doubling your inference bill

Shadow mode temporarily doubles the number of LLM API calls you make. If you are on a per-token pricing plan, your daily bill doubles for the duration. Budget for this before you start, set billing alerts, and consider shadowing only a 10–20% sample rather than 100% of traffic. The cost is worth it, but it needs to be expected.

Treating the shadow as a true A/B test

Shadow mode is not a user-experience A/B test. Because users never see the candidate's responses, you cannot measure downstream behavioural outcomes — engagement, satisfaction scores, conversion rates — during the shadow phase. Shadow mode measures model output quality, not user response to that quality. If you need to measure user outcomes, you need a canary or an A/B test.

Non-determinism making comparison noisy

LLMs are stochastic. Even two calls to the same model with identical inputs will produce slightly different outputs. When the production model and the candidate are different versions, the difference you observe in shadow logs is a combination of the genuine quality difference plus random variation. Make sure your sample size is large enough (typically a few hundred to a few thousand requests) to let the signal emerge from the noise before drawing conclusions.

Side effects in tool-calling or agentic systems

If your LLM system calls external tools — databases, APIs, file systems — the shadow candidate must never execute those tools for real. The candidate should run with tool calling stubbed out or redirected to a sandbox environment. Calling a real payment API or writing to a production database during a shadow run is a serious incident waiting to happen. This is the single most dangerous pitfall in agentic shadow testing.

Log store becoming a bottleneck

Shadow logging adds write volume proportional to your traffic. If the logging path is synchronous and the log store is slow, it will add latency to user-facing requests. Always write shadow logs on a separate async worker thread and decouple the logging path completely from the response path.

Going deeper

Once you have the basics working, there are several directions worth exploring to make shadow testing more powerful.

Continuous shadow mode

Rather than running shadow mode only when you want to ship a new model, some teams keep a shadow lane permanently active and rotate candidate models through it continuously. This gives you an always-on regression signal: any future model update is automatically shadowed before promotion. The cost is a permanent 10–20% overhead on inference spend, but the benefit is that regressions are caught within hours rather than weeks.

Replay-based shadow testing

If you cannot run the candidate on live traffic (because you are testing before a new model is deployed at all, or because you need deterministic results), you can replay logged historical requests against the candidate. Replay testing is less realistic than live shadowing — the request distribution may have shifted since the logs were collected — but it is a useful first pass and requires no infrastructure changes to your production system.

Platform features to know

Several platforms have first-class shadow testing support:

AWS SageMaker — the managed endpoint concept supports named shadow variants; SageMaker handles traffic mirroring and comparison dashboards natively.
Seldon Core — the open-source Kubernetes model server supports shadow deployments via its SeldonDeployment custom resource, routing mirrored traffic to a shadow predictor without serving its responses.
Envoy / Istio — the service mesh's traffic mirroring feature (via VirtualService mirror configuration) can shadow any HTTP traffic, including LLM API calls, at the infrastructure layer.
LLM gateway proxies — products like Portkey, LiteLLM, and Brainboard allow shadow routing to be configured in a YAML/JSON config without any application code changes.

Automating the promotion decision

The end goal is a fully automated shadow-to-canary pipeline. The pattern looks like this: the shadow run is triggered automatically when a new model version is registered in your model registry; a CI/CD job monitors the shadow metrics against your pre-defined release gates; if all gates pass after a minimum observation window (24–48 hours is common), the pipeline automatically promotes the candidate to a 5% canary and pages your team to review the dashboard. Human approval is still required for the final promotion to 100%, but the shadow phase is fully automated. This approach is described as a core LLMOps maturity milestone in the ZenML 2025 survey of 1,200 production deployments.

FAQ

Does shadow mode require double the API budget?

Yes, during the shadow period you are making two inference calls per shadowed request. To control cost, shadow only a fraction of traffic (10–20% is common) rather than 100%. The spend is temporary and bounded — a typical shadow run lasts 24–72 hours — and is far cheaper than a bad rollout.

Can I use shadow mode to test a completely different model provider?

Absolutely. Shadow mode is provider-agnostic. You can mirror traffic from a GPT production endpoint to a Claude or Gemini candidate, or from a cloud-hosted model to a self-hosted open-weight model. The only requirement is that both endpoints accept the same request format — if they differ, add a thin adapter layer before the candidate call.

How is shadow mode different from A/B testing?

In shadow mode, users never see the candidate's output, so you cannot measure user behaviour. A/B testing splits live traffic so that real users see one variant or the other, which lets you measure downstream outcomes like engagement or satisfaction. Shadow mode validates technical correctness and quality before any user exposure; A/B testing measures user response after you are confident the candidate is production-safe.

How long should a shadow run last?

Long enough to accumulate a statistically meaningful sample across your traffic patterns. For most services, 24–48 hours at a 10–20% sampling rate is sufficient. If your traffic has strong day-of-week variation (e.g., B2B tools that spike on weekdays), run the shadow for at least one full business cycle — typically 5–7 days — to cover the full distribution.

What happens if the candidate model errors out during the shadow run?

Nothing visible to users — that is the whole point. The error is logged, the production model's response is returned to the user as normal, and the error rate in the shadow logs becomes a data point in your go/no-go analysis. A high error rate in shadow mode is exactly the kind of regression you want to catch before promoting the candidate.

Is shadow mode safe to use with LLM agents that call external tools?

Only if you stub out or sandbox the tool calls. A shadow agent must never write to a production database, call a payment API, or send emails — it should only predict what it would do without actually doing it. Build a mock tool layer for the shadow path, or use a sandboxed environment with fake credentials.

What Is Shadow Mode? Testing a New Model on Real Traffic Silently

In plain English

Why it matters

How it works

Where to insert the mirror

Sampling rate

What to log

Comparing shadow outputs

Structural metrics (automatic)

LLM-as-judge (scalable quality signal)

Human review (ground truth for high-stakes decisions)

Defining release gates before you start

Shadow mode vs. canary release

Common pitfalls

Doubling your inference bill

Treating the shadow as a true A/B test

Non-determinism making comparison noisy

Side effects in tool-calling or agentic systems

Log store becoming a bottleneck

Going deeper

Continuous shadow mode

Replay-based shadow testing

Platform features to know

Automating the promotion decision

FAQ

Further reading

// In plain English

// Why it matters

// How it works

Where to insert the mirror

Sampling rate

What to log

// Comparing shadow outputs

Structural metrics (automatic)

LLM-as-judge (scalable quality signal)

Human review (ground truth for high-stakes decisions)

Defining release gates before you start

// Shadow mode vs. canary release

// Common pitfalls

Doubling your inference bill

Treating the shadow as a true A/B test

Non-determinism making comparison noisy

Side effects in tool-calling or agentic systems

Log store becoming a bottleneck

// Going deeper

Continuous shadow mode

Replay-based shadow testing

Platform features to know

Automating the promotion decision

// FAQ

// Further reading

// Related

In plain English

Why it matters

How it works

Comparing shadow outputs

Shadow mode vs. canary release

Common pitfalls

Going deeper

FAQ

Further reading

Related