How to Handle 429 Errors: Retries, Backoff, and Request Queues

Q: What is the difference between 429 Too Many Requests and 503 Service Unavailable?

A 429 means the server is healthy but your client is sending too fast — a rate limit you can back off from. A 503 means the server is temporarily unable to handle requests at all (overloaded or down). Both warrant retries, but 503 usually needs a longer wait and may indicate a provider incident worth checking a status page for.

Q: Should I retry 429s on all LLM providers the same way?

The algorithm is identical, but the headers differ. Always check for a `retry-after` or `retry-after-ms` header first and honor it. OpenAI uses `retry-after-ms` (milliseconds) on 429 responses; Anthropic uses `retry-after` (seconds). If neither is present, fall back to exponential backoff with full jitter.

Q: How do I avoid 429 errors in a batch job that processes thousands of documents?

Use a client-side rate-limited queue that enforces the provider's RPM and TPM limits before sending each request. Pre-estimate token counts with tiktoken (OpenAI) or `client.messages.count_tokens()` (Anthropic) to stay within the TPM limit. For very large batch jobs, use the provider's batch API (OpenAI Batch, Anthropic's Message Batches API) which has higher throughput and lower cost at the expense of async delivery.

Q: Is full jitter better than exponential backoff without jitter?

Yes, significantly so when multiple clients share the same API key or when you have concurrent workers. Without jitter, all workers sleep the same duration and wake up simultaneously, creating a synchronized burst (thundering herd) that immediately trips the rate limit again. Full jitter scatters wake-up times across the window so the combined load is spread evenly.

Q: Can I use the OpenAI or Anthropic SDK's built-in retry instead of writing my own?

For simple cases, yes. Both SDKs accept a `max_retries` constructor argument and implement backoff internally. However, the SDK does not expose the queue-based proactive throttling you need for high-concurrency apps, and it does not give you control over what happens when retries are exhausted. For anything beyond a prototype, wrap the SDK call in your own retry + queue logic.

Q: What is the thundering herd problem and why does it matter for retries?

The thundering herd is what happens when many clients hit a rate limit at the same moment, all sleep the same fixed interval, and then all retry at exactly the same moment — creating another synchronized burst that triggers another 429. Adding random jitter to the sleep duration breaks the synchronization so retries arrive as a smooth stream rather than a spike.

Build retry logic that actually works under load — exponential backoff with jitter, retry-after headers, and a client-side queue.

INTERMEDIATE12 MIN READUPDATED 2026-06-12

In plain English

HTTP status 429 Too Many Requests is the server telling you: slow down, you're sending faster than I'm willing to serve you. Every LLM provider — Anthropic, OpenAI, Google — enforces rate limits measured in requests per minute (RPM) and tokens per minute (TPM). When your code crosses those limits, the response comes back with a 429 instead of a completion.

Think of it like a busy coffee shop with a single cashier. You can't place three orders simultaneously — you have to queue. The cashier (the API) can only take so many orders per minute. If you rush the counter too aggressively, you don't get your coffee, you get turned away. The smart approach isn't to shout louder, it's to wait your turn and try again politely.

The good news: a 429 is not an error in the sense of "something is broken." It's a signal — the server is healthy, your request was valid, you just need to back off and retry. Almost every production LLM app needs a retry strategy, because even well-budgeted applications hit brief rate-limit windows during traffic spikes.

Why it matters

Naive retry logic — sleep one second, try again — fails in production for two reasons. First, if ten workers all sleep the same second and wake up together, they produce a thundering herd: a synchronized burst that trips the rate limit again immediately, creating a loop that never converges. Second, a fixed wait doesn't respect the server's own guidance on when it will be ready — so you may wait too long (wasting latency) or not long enough (getting another 429).

Getting this right matters because:

User-facing latency — a badly implemented retry loop can add 10–30s of delay to a request, turning a 2-second interaction into a timeout.
Cost amplification — if a bug causes infinite retries, you can rack up a large bill before anyone notices. Retries should be capped.
Correctness — LLM calls are not free side-effect operations. Retrying a non-idempotent action (e.g., "send this email") without a guard can duplicate real-world effects.
Quota fairness — in multi-tenant apps, one user's burst shouldn't exhaust the shared API budget for everyone. Client-side queuing is the fix.
Reliability — apps that crash on first 429 feel brittle. Apps with clean retry handling degrade gracefully and self-heal automatically.

How it works

The standard solution is exponential backoff with full jitter. On each failed attempt, you wait for a random time drawn from an exponentially growing window. The algorithm has three components:

Check for Retry-After. Read the retry-after (or retry-after-ms) response header. If present, it is the server's own advice on when to retry — honor it exactly. Never retry sooner than this value.
Compute the backoff window. If no Retry-After is present, use min(cap, base * 2^attempt). A common default: base = 1s, cap = 60s. After attempt 1 the window is 0–2s, after attempt 4 it is 0–16s, after attempt 6 it is capped at 0–60s.
Apply full jitter. Pick a random value uniformly from [0, window]. This spreads retries across time, collapsing the thundering herd into a smooth distribution.
Cap total attempts. Stop after 5–8 retries and surface the error. Unbounded retries can stall the event loop and obscure real bugs.

// Exponential backoff with jitter — one request lifecycle

Send API requestattempt N429 received?check status codeRead Retry-After headerif present, use it; else compute windowSleep random(0, min(cap, base × 2^N))full jitter — no thundering herdRetry or give upif N < max_attempts, go back to step 1

What the provider headers tell you

Both Anthropic and OpenAI return rate-limit headers with every response — not just on 429s. These let you implement proactive throttling before you ever hit a limit.

Provider	Header	Meaning
Anthropic	`anthropic-ratelimit-requests-remaining`	Requests left in the current window
Anthropic	`anthropic-ratelimit-requests-reset`	ISO timestamp when the window resets
Anthropic	`anthropic-ratelimit-tokens-remaining`	Tokens left this minute (most restrictive limit)
Anthropic	`retry-after`	Seconds to wait (only on 429)
OpenAI	`x-ratelimit-remaining-requests`	Requests left in the current window
OpenAI	`x-ratelimit-reset-requests`	Time until the window resets
OpenAI	`x-ratelimit-remaining-tokens`	Tokens left this minute
OpenAI	`retry-after-ms`	Milliseconds to wait (only on 429)

Retry in code

The cleanest Python approach is the tenacity library, which wraps the full jitter algorithm behind a decorator. You don't need to hand-roll the backoff math.

Exponential backoff with full jitter using tenacitypython

import anthropic
from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential,
    retry_if_exception_type,
)

client = anthropic.Anthropic()

@retry(
    retry=retry_if_exception_type(anthropic.RateLimitError),
    wait=wait_random_exponential(min=1, max=60),  # full jitter, capped at 60s
    stop=stop_after_attempt(6),
)
def call_with_retry(prompt: str) -> str:
    message = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
    )
    return message.content[0].text

# Usage — transparent to the caller
result = call_with_retry("Summarise this document: ...")

For OpenAI the pattern is identical — swap anthropic.RateLimitError for openai.RateLimitError. The wait_random_exponential function implements the full-jitter algorithm from the AWS architecture blog: it picks a uniform random value from [0, min(cap, base × 2^attempt)], which has the best convergence properties under high contention.

When you need to respect the Retry-After header explicitly — rather than trusting tenacity's default math — you can read the header from the exception and use it as a floor:

Manual retry loop that reads Retry-Afterpython

import time
import random
import anthropic

client = anthropic.Anthropic()

def call_respecting_retry_after(
    prompt: str,
    max_attempts: int = 6,
) -> str:
    base_delay = 1.0
    cap = 60.0

    for attempt in range(max_attempts):
        try:
            msg = client.messages.create(
                model="claude-opus-4-5",
                max_tokens=1024,
                messages=[{"role": "user", "content": prompt}],
            )
            return msg.content[0].text

        except anthropic.RateLimitError as exc:
            if attempt == max_attempts - 1:
                raise  # exhausted, bubble up

            # Respect Retry-After if the server provided it
            retry_after = getattr(exc, "retry_after", None)
            if retry_after is not None:
                wait = float(retry_after)
            else:
                window = min(cap, base_delay * (2 ** attempt))
                wait = random.uniform(0, window)  # full jitter

            print(f"429 on attempt {attempt + 1}; sleeping {wait:.1f}s")
            time.sleep(wait)

    raise RuntimeError("Unreachable")

The client-side request queue

Retry logic handles the aftermath of hitting a 429. A client-side request queue prevents you from hitting it in the first place. The idea is simple: instead of firing every API call immediately, route them through a queue that enforces the provider's own rate limits on your side.

For LLM APIs you need two parallel buckets — one for RPM and one for TPM — because providers enforce both independently. A request that fits the RPM budget may still be blocked if it exceeds the TPM budget:

// Client-side dual-bucket queue

Incoming requestsN concurrent callersFIFO queuebuffer until slots are availableRPM bucket check≤ limit requests/min?TPM bucket check≤ limit tokens/min?Send to APIboth buckets had capacity

The TypeScript example below uses a simple sliding-window approach that works in a Node.js app or serverless function. For Python async code you can adapt the same pattern with asyncio.Queue.

Simple RPM-limited request queue (TypeScript)typescript

interface QueuedRequest<T> {
  fn: () => Promise<T>;
  resolve: (value: T) => void;
  reject: (error: unknown) => void;
}

export class RateLimitedQueue {
  private queue: QueuedRequest<unknown>[] = [];
  private timestamps: number[] = []; // rolling window of request times
  private readonly rpm: number;
  private readonly windowMs = 60_000;
  private running = false;

  constructor(rpm: number) {
    this.rpm = rpm;
  }

  enqueue<T>(fn: () => Promise<T>): Promise<T> {
    return new Promise((resolve, reject) => {
      this.queue.push({ fn, resolve, reject } as QueuedRequest<unknown>);
      if (!this.running) this.drain();
    });
  }

  private async drain() {
    this.running = true;
    while (this.queue.length > 0) {
      const now = Date.now();
      // Purge timestamps older than 60s
      this.timestamps = this.timestamps.filter(t => now - t < this.windowMs);

      if (this.timestamps.length >= this.rpm) {
        // Sleep until the oldest timestamp falls out of the window
        const wait = this.windowMs - (now - this.timestamps[0]);
        await new Promise(r => setTimeout(r, wait + 50)); // +50ms margin
        continue;
      }

      const item = this.queue.shift()!;
      this.timestamps.push(Date.now());
      item.fn().then(item.resolve).catch(item.reject);
    }
    this.running = false;
  }
}

// Usage
const queue = new RateLimitedQueue(50); // 50 RPM (Tier 1 Anthropic)

async function ask(prompt: string) {
  return queue.enqueue(() =>
    anthropicClient.messages.create({ ... })
  );
}

For large-scale multi-service architectures, a centralized rate-limit proxy is the natural next step: a single internal service that all your microservices funnel through. It holds the global RPM and TPM counters, queues or rejects excess requests, and distributes the API budget fairly across consumers. Open-source options like LiteLLM Proxy provide this out of the box.

Going deeper

Proactive throttling from response headers

The cleanest production systems never hit a 429 wall at all. They read the x-ratelimit-remaining-* (OpenAI) or anthropic-ratelimit-*-remaining (Anthropic) headers on every successful response and slow down proactively when remaining budget falls below a threshold — typically 20% of the per-minute limit. This eliminates retry jitter latency entirely for well-behaved traffic.

Idempotency and side effects

Pure completions ("summarise this text") are safe to retry — the same input always produces an equivalent output. But if your LLM call is part of a larger action — writing to a database, sending an email, posting a webhook — you must guard against double execution. Use an idempotency key or a distributed lock that marks the action as started before the API call. If the process crashes mid-retry, the lock prevents a second execution on restart.

Circuit breaker pattern

Exponential backoff handles short rate-limit events. For a sustained outage — where the provider's limits are hit continuously for minutes — a circuit breaker is more appropriate. The circuit trips open after N consecutive 429s, returns a cached or degraded response immediately for the next T seconds, then allows a single probe request through. If the probe succeeds, the circuit closes. If it fails, the timeout resets. Libraries like pybreaker (Python) or opossum (Node.js) implement this.

Tier upgrades vs. architectural fixes

If you are consistently hitting rate limits despite correct retry logic, you have two levers: upgrade your tier (Anthropic and OpenAI both have spend-based tier progression that unlocks higher RPM/TPM), or reduce your per-request token footprint through prompt compression, caching, or routing simpler tasks to a cheaper and higher-throughput model. Tier upgrades cost money; architectural fixes often cost less long-term.

Full jitter vs. equal jitter vs. decorrelated jitter

The AWS Architecture Blog's canonical comparison ("Exponential Backoff And Jitter") found that full jitter — sleep = random(0, cap) — produces the best throughput under high contention because it spreads load most evenly. Equal jitter — sleep = cap/2 + random(0, cap/2) — guarantees a minimum wait, useful if you want to avoid zero-second retries. Decorrelated jitter — sleep = random(base, sleep * 3) — can produce higher average waits and is generally not preferred for LLM retries. For most applications, full jitter with wait_random_exponential from tenacity is the correct choice.

FAQ

What is the difference between 429 Too Many Requests and 503 Service Unavailable?

A 429 means the server is healthy but your client is sending too fast — a rate limit you can back off from. A 503 means the server is temporarily unable to handle requests at all (overloaded or down). Both warrant retries, but 503 usually needs a longer wait and may indicate a provider incident worth checking a status page for.

Should I retry 429s on all LLM providers the same way?

The algorithm is identical, but the headers differ. Always check for a retry-after or retry-after-ms header first and honor it. OpenAI uses retry-after-ms (milliseconds) on 429 responses; Anthropic uses retry-after (seconds). If neither is present, fall back to exponential backoff with full jitter.

How do I avoid 429 errors in a batch job that processes thousands of documents?

Use a client-side rate-limited queue that enforces the provider's RPM and TPM limits before sending each request. Pre-estimate token counts with tiktoken (OpenAI) or client.messages.count_tokens() (Anthropic) to stay within the TPM limit. For very large batch jobs, use the provider's batch API (OpenAI Batch, Anthropic's Message Batches API) which has higher throughput and lower cost at the expense of async delivery.

Is full jitter better than exponential backoff without jitter?

Yes, significantly so when multiple clients share the same API key or when you have concurrent workers. Without jitter, all workers sleep the same duration and wake up simultaneously, creating a synchronized burst (thundering herd) that immediately trips the rate limit again. Full jitter scatters wake-up times across the window so the combined load is spread evenly.

Can I use the OpenAI or Anthropic SDK's built-in retry instead of writing my own?

For simple cases, yes. Both SDKs accept a max_retries constructor argument and implement backoff internally. However, the SDK does not expose the queue-based proactive throttling you need for high-concurrency apps, and it does not give you control over what happens when retries are exhausted. For anything beyond a prototype, wrap the SDK call in your own retry + queue logic.

What is the thundering herd problem and why does it matter for retries?

The thundering herd is what happens when many clients hit a rate limit at the same moment, all sleep the same fixed interval, and then all retry at exactly the same moment — creating another synchronized burst that triggers another 429. Adding random jitter to the sleep duration breaks the synchronization so retries arrive as a smooth stream rather than a spike.

// In plain English

// Why it matters

// How it works

What the provider headers tell you

// Retry in code

// The client-side request queue

// Going deeper

Proactive throttling from response headers

Idempotency and side effects

Circuit breaker pattern

Tier upgrades vs. architectural fixes

Full jitter vs. equal jitter vs. decorrelated jitter

// FAQ

// Further reading

// Related

In plain English

Why it matters

How it works

Retry in code

The client-side request queue

Going deeper

FAQ

Further reading

Related