LLM Provider Outages: Fallbacks and Multi-Provider Failover

Learn to design fallbacks and multi-provider failover so a provider outage degrades your app instead of killing it.

INTERMEDIATE11 MIN READUPDATED 2026-06-12

In plain English

LLM provider failover is the practice of automatically routing your API calls to a backup provider — or a backup model — the moment your primary one starts returning errors, timeouts, or rate-limit responses. Instead of your app showing a blank screen (or crashing), it quietly tries the next option on your list and returns an answer.

LLM Provider Outages — diagram — LLM Provider Outages — linkedin.com

Think of it like a hospital's power supply. The hospital is wired to the grid, but there's a diesel generator in the basement. When the grid goes dark, the transfer switch trips in milliseconds and surgery keeps going. No one in the operating room noticed. Your failover stack is that transfer switch: it sits in the request path, watches for the grid to fail, and silently swaps the source of power.

The key insight is that failover and fallback are not the same thing, even though people use the words interchangeably. Failover usually means swapping infrastructure when the provider is completely unreachable (HTTP 5xx, connection timeout). Fallback is broader: it also handles semantic failures — a 429 rate-limit, a context-window overflow, or a content-policy block — where the provider is technically up but can't serve this request. A robust production setup handles both.

Why it matters

LLM providers go down more than traditional cloud infrastructure — and they go down globally. When an Azure region fails, Azure failover is geographic. When OpenAI's routing layer has a bug, it can take down every region at once. There is no automatic geographic failover built into the provider's API endpoint. Your only defense is to have a different provider ready.

Real outage numbers

The numbers are sobering. OpenAI reported nine distinct outages in a single quarter of 2024. Anthropic's Claude API measured roughly 99.32% uptime over a 30-day window — which sounds high until you realize it means nearly five hours of downtime per month. A multi-hour global outage hit OpenAI in June 2025; a routing-misconfiguration event followed in December 2025. Anthropic, Google Gemini, and others have had comparable incidents. These are not edge cases — they are the normal cadence of a rapidly scaling industry.

For a user-facing product, five hours of monthly downtime is a support-ticket flood, a churn event, and potentially an SLA breach. For an autonomous agent running overnight jobs, a single-provider hard failure means the job simply doesn't complete. The cost of not having failover is visible in your on-call rotation and your renewal conversations.

What breaks without it

User-facing features go dark during provider incidents, even when your own infrastructure is healthy.
Async pipelines stall — batch summarization, nightly evals, scheduled agents — because there's no fallback path when the primary times out.
Rate limits cascade — a burst of traffic exhausts your quota on one provider and every subsequent request fails, even though other providers have capacity.
Context-window errors kill requests silently — a prompt that just crossed 128k tokens errors out with no automatic retry on a model with a larger window.

How it works

A complete failover stack has three layers working together: retries (try the same provider again), fallbacks (try a different provider), and circuit breakers (stop trying a broken provider before it drags down your whole request queue). Each layer targets a different failure mode.

// Request path through a failover stack

Incoming requestyour app codeLLM Gateway / Routernormalizes provider APIsRetry with jittersame provider, 2-3 attemptsCircuit breaker checkis provider circuit open?Fallback chainnext provider in priority listResponse returnedsuccess or graceful error

Layer 1 — Retries with exponential backoff and jitter

Transient errors — a momentary 503, a brief network blip — often resolve in seconds. A retry with a short delay catches these without switching providers at all. The standard pattern is exponential backoff: wait 1s, then 2s, then 4s between attempts, capped at some maximum. The critical addition is jitter — randomizing the wait by ±30-50%. Without jitter, all the clients that hit the same outage retry at exactly the same intervals, creating a thundering-herd wave that hammers the just-recovered provider and triggers a second outage. Jitter desynchronizes the wave.

pythonpython

import random, time, httpx

def call_with_backoff(client, payload, max_retries=3):
    delay = 1.0
    for attempt in range(max_retries):
        try:
            return client.post("/chat/completions", json=payload, timeout=30)
        except (httpx.TimeoutException, httpx.HTTPStatusError) as e:
            if attempt == max_retries - 1:
                raise
            jitter = random.uniform(0.5, 1.5)
            time.sleep(delay * jitter)
            delay *= 2  # exponential backoff

Layer 2 — Fallback chains

Once retries are exhausted, the gateway walks a fallback chain — an ordered list of providers ranked by preference. A typical chain for a GPT-5.5 primary might be: GPT-5.5 → Claude Sonnet 4.6 → Gemini 3 Pro → a self-hosted open model. The first healthy provider in the chain gets the request. LiteLLM, for example, lets you configure typed fallbacks so that a rate-limit error (429) routes to a different backup than a context-window error, giving you fine-grained control without custom code.

Layer 3 — Circuit breakers

Retrying a provider that is fully down is expensive: each attempt burns latency and counts against rate limits on the other providers waiting in queue. A circuit breaker short-circuits this by tracking the error rate of each provider in a sliding window. When errors exceed a threshold (commonly 50% over the last 10 requests), the circuit opens and the provider is removed from the rotation for a cooldown period — typically 30-60 seconds. After the cooldown, the circuit enters a half-open state: one probe request is let through. If it succeeds, the circuit closes and full traffic resumes. If it fails, the cooldown restarts.

// Circuit breaker states

Closedall requests pass through; failures trackedOpenprovider bypassed; requests go to fallbackHalf-Openone probe request let throughClosedprobe succeeded; normal traffic resumes↺ repeat

Types of fallback triggers

Not all failures should trigger the same response. Routing every error type to the same backup wastes capacity and can mask real problems. A mature failover stack distinguishes at least four failure classes:

Failure type	HTTP signal	Recommended response
Provider down / server error	5xx, connection timeout	Retry 1-2x, then failover to next provider
Rate limit exhausted	429 Too Many Requests	Skip retries, immediately failover (retrying burns quota)
Context window exceeded	400 with context error	Fallback to a model with a larger context window
Content policy block	400 with policy error	Fallback to a provider with different guardrails, or return a structured error
Latency SLO breach	200 but slow	Hedge or failover if P99 latency exceeds threshold

LiteLLM exposes this directly in its config: fallbacks (all errors), context_window_fallbacks (token-limit errors), and content_policy_fallbacks (policy violations). Portkey and other gateways offer similar typed routing. Never use a single catch-all fallback — a 429 and a 503 are completely different problems and the right secondary target may differ.

Maintaining quality across providers

The engineering challenge with failover is not availability — it's output consistency. The same prompt sent to GPT-5.5, Claude Sonnet 4.6, and Gemini 3 Pro will produce subtly different answers. For casual chat, this is fine. For a workflow that extracts structured JSON, classifies intent into a fixed taxonomy, or must follow a particular voice, the differences can matter.

Strategies for consistent output

Test your fallback models explicitly. Run your eval suite against every provider in your chain before going to production. Discover the gaps before your users do.
Use provider-specific prompt variants. The same instruction phrased differently can close a quality gap. A thin adapter layer can rewrite prompts per provider.
Validate structured outputs on the way out. If your pipeline needs a JSON object, parse and validate the response from the fallback model before passing it downstream — don't assume format parity.
Log which provider served each request. Tag your traces with the provider used so you can correlate quality regressions to failover events in your observability tool.
Set graceful-degradation expectations. Some flows can degrade gracefully — a summary with slightly different wording is fine. Others cannot — a tool-call output with a missing field breaks a downstream step. Know which is which before you're in an incident.

Practical tools and configuration

Building retry/fallback/circuit-breaker logic from scratch is a solved problem — use a gateway or library instead. Here are the leading options in 2025-2026:

Tool	Type	Key strength
LiteLLM	Open-source proxy + Python library	Typed fallbacks, 100+ provider support, YAML config
Portkey	Managed gateway + open-source core	Circuit breakers on P99 latency, built-in observability
Requesty	Managed gateway	Automatic health monitoring, zero-config failover
Tenacity (Python)	Retry library	Flexible retry decorators with jitter; combine with any HTTP client

LiteLLM fallback config (YAML)

The following LiteLLM proxy config routes GPT-5.5 as primary, falls back to Claude Sonnet 4.6 on any error, and uses a higher-capability Claude model if the context window overflows:

yamlyaml

model_list:
  - model_name: gpt-5.5
    litellm_params:
      model: openai/gpt-5.5
      api_key: os.environ/OPENAI_API_KEY

  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY

  - model_name: claude-opus
    litellm_params:
      model: anthropic/claude-opus-4-8
      api_key: os.environ/ANTHROPIC_API_KEY

litellm_settings:
  num_retries: 2
  request_timeout: 30
  fallbacks:
    - gpt-5.5: ["claude-sonnet"]
  context_window_fallbacks:
    - gpt-5.5: ["claude-opus"]

The proxy normalizes both providers to the OpenAI chat-completions schema, so your application code makes a single API call and never knows which provider served the response.

Going deeper

Once the basics are solid, three advanced patterns push your resilience further: hedged requests, load balancing with weighted failover, and semantic health checks.

Hedged requests

A hedged request fires the same prompt at two providers simultaneously and returns whichever responds first, cancelling the other. This is the most aggressive latency-tail strategy — it nearly eliminates P99 spikes — but it doubles your token spend. Use it only for latency-critical, user-facing interactions where the cost trade-off makes sense, not for batch workloads.

Weighted load balancing with failover exclusion

Rather than a strict primary/backup ordering, you can distribute traffic across providers by weight — say, 70% OpenAI, 20% Anthropic, 10% Google — and exclude any provider that trips the circuit breaker from the weight pool until it recovers. LiteLLM's router supports this via routing_strategy: latency-based-routing combined with fallbacks. This spreads rate-limit risk and keeps all providers warm so a failover event doesn't send cold traffic to a rarely-used backup.

Semantic health checks

Standard circuit breakers trigger on HTTP error codes. But an LLM can return HTTP 200 with a malformed or empty response body — technically not an error, but functionally broken. A semantic health check parses the response and validates it before declaring success: does it contain the expected JSON keys? Is the response non-empty? Does it pass a basic coherence check? This prevents a quietly-degraded provider from staying in the rotation just because it keeps the TCP connection alive.

Multi-region deployment of open models as a last-resort fallback

For teams with strict uptime SLAs, a self-hosted open model (Llama 4, Mistral, or similar via Ollama or vLLM) on your own infrastructure can serve as the final entry in the fallback chain — the one that is always up because you control it. Its quality may be lower than the frontier providers, but it keeps your app alive during a simultaneous multi-provider outage. Treat it as a circuit-breaker catch-all, not a primary.

FAQ

What is the difference between LLM failover and LLM fallback?

Failover typically refers to infrastructure-level switching when a provider is completely unreachable (5xx errors, connection timeouts). Fallback is broader: it includes semantic failures like rate limits (429), context-window overflows, and content-policy blocks. A robust setup handles both — retrying transient errors, circuit-breaking hard failures, and routing specific error classes to the right backup model.

How often do LLM providers actually go down?

More often than most engineers expect. OpenAI reported nine outages in a single quarter in 2024. Anthropic's Claude API measured roughly 99.32% uptime over a 30-day window — nearly five hours of downtime per month. Major incidents affected OpenAI in June 2025 and December 2025. Planning for at least one multi-hour outage per month per provider is a realistic baseline.

Will my users notice if the fallback model is a different provider?

Possibly, depending on the use case. For open-ended chat, output variation is usually invisible. For structured outputs (JSON extraction, classification), the fallback model may format things differently, so validate responses at the boundary. The best practice is to test your fallback models against your eval suite before going to production and to log which provider served each request for post-incident analysis.

What is exponential backoff with jitter and why does it matter?

Exponential backoff means doubling the wait between retries: 1s, then 2s, then 4s. Jitter adds randomness (±30-50%) to that delay. Without jitter, all clients that hit the same outage retry at exactly the same moments, creating a thundering-herd wave that can cause a second outage on the just-recovered provider. Jitter desynchronizes the retries and is considered the single most important retry correctness fix in high-traffic LLM systems.

Do I need to build failover logic myself?

No. Tools like LiteLLM (open-source) and Portkey (managed gateway) implement the full retry/fallback/circuit-breaker stack as configuration. Your application code makes a single API call to the gateway; the gateway handles provider selection, retries, and failover transparently. Building this from scratch is only worth it if your routing logic is highly custom.

Should I use a different backup model for rate limits versus server errors?

Yes. A 429 rate-limit means the provider is up but your quota is exhausted — retrying the same provider wastes time. A 5xx means the provider may recover in seconds — a short retry is worth trying first. LiteLLM's typed fallbacks (fallbacks for general errors, context_window_fallbacks, content_policy_fallbacks) let you configure different backup targets per failure class.

// In plain English

// Why it matters

Real outage numbers

What breaks without it

// How it works

Layer 1 — Retries with exponential backoff and jitter

Layer 2 — Fallback chains

Layer 3 — Circuit breakers

// Types of fallback triggers

// Maintaining quality across providers

Strategies for consistent output

// Practical tools and configuration

LiteLLM fallback config (YAML)

// Going deeper

Hedged requests

Weighted load balancing with failover exclusion

Semantic health checks

Multi-region deployment of open models as a last-resort fallback

// FAQ

// Further reading

// Related

In plain English

Why it matters

How it works

Types of fallback triggers

Maintaining quality across providers

Practical tools and configuration

Going deeper

FAQ

Further reading

Related