In plain English
Every LLM API call can fail in a surprising number of ways — and most of them look identical at first glance: your code gets back something unexpected and the output is wrong, missing, or absent entirely. The good news is that there are really only two categories of failure: the HTTP layer (the request never completed, or completed with an error status code) and the model layer (the request completed fine, but the model stopped early or declined to answer).
Think of it like a restaurant order. The first category is the waiter failing: the kitchen is on fire (503 Service Unavailable), your reservation expired (401 Unauthorized), or you ordered something that isn't on the menu (400 Bad Request). The second category is the chef stopping mid-dish: they ran out of the special ingredient (max_tokens cutoff), the recipe calls for a sauce they won't make (content_filter refusal), or the dish is simply finished (stop / end_turn).
Understanding which category you are in is the first diagnostic step, because the remedies are completely different. A 503 calls for a retry with exponential backoff. A finish_reason: length calls for raising your token budget or splitting the task. A content refusal calls for rephrasing the prompt. Retrying a refused or malformed request is wasted money.
Why it matters
Silent failures are the most dangerous bugs in LLM applications. If your summarizer truncates a legal document mid-sentence and returns 200 OK, nothing throws an exception. The user sees a plausible but incomplete summary. The field finish_reason (OpenAI) or stop_reason (Anthropic) is the only signal that the output was cut off — and most beginner code ignores it entirely.
Rate-limit errors (429) and server errors (503) come up constantly in production. LLM providers throttle requests aggressively to protect capacity. Without a retry strategy, a spike of parallel calls will cause cascading failures across your app. With the right backoff logic, the same spike barely registers as a blip.
Content refusals are a third source of unexpected behaviour. A model will sometimes decline to answer a perfectly reasonable question — a false positive from safety classifiers, a topic that triggers a guardrail, or a prompt phrasing that looks adversarial. Knowing the exact signal for a refusal (versus a truncation or a network timeout) means you can route around it intelligently rather than silently returning an empty response to the user.
- Production reliability — unhandled
429and503errors are the most common reason LLM-powered features go down under load. - Data integrity —
finish_reason: lengthis a silent truncation; checking it prevents incomplete structured output from being parsed and acted on. - User experience — distinguishing a transient server error (retry-able) from a content refusal (needs rephrasing) lets you show the right message instead of a generic 'something went wrong.'
- Cost control — retrying non-retryable errors (400, 401, 403) burns tokens and money with zero chance of success.
How the error layers work
An LLM API call passes through two distinct checkpoints before you see output. Understanding each checkpoint and its failure modes is the foundation of reliable error handling.
HTTP status codes — the first checkpoint
| Status | Name | Cause | Action |
|---|---|---|---|
| 400 | Bad Request | Malformed JSON, missing required field, invalid model name, message too long for context window | Fix the request. Do NOT retry — the same request will always fail. |
| 401 | Unauthorized | Missing or invalid API key | Check your key. Rotate it if compromised. Do NOT retry. |
| 403 | Forbidden | Key is valid but lacks permission for this model or feature | Check your account tier or organisation policy. Do NOT retry. |
| 429 | Too Many Requests | Rate limit hit: requests per minute, tokens per minute, or daily quota | Wait and retry with exponential backoff. Respect the Retry-After header. |
| 500 | Internal Server Error | Provider-side bug or unexpected model error | Retry with backoff (1–3 attempts). If persistent, check the provider status page. |
| 503 | Service Unavailable | Provider overloaded or in maintenance | Retry with backoff. Consider falling back to a different model or provider. |
| 504 | Gateway Timeout | Request took longer than the provider's gateway timeout | Retry. For large inputs, try chunking to reduce generation time. |
Stop reasons / finish reasons — the second checkpoint
When the HTTP call succeeds (200 OK), the response body includes a field that tells you why the model stopped generating. OpenAI's Chat Completions API calls this finish_reason; Anthropic's Messages API calls it stop_reason. Different names, same concept.
| OpenAI finish_reason | Anthropic stop_reason | Meaning | Action |
|---|---|---|---|
| stop | end_turn | Model finished naturally — this is the happy path | None needed. Output is complete. |
| length | max_tokens | Hit the max token limit before finishing | Raise max_tokens, split the task, or use continuation prompting. |
| content_filter | refusal (Claude 4+) | Safety classifier blocked the output | Rephrase the prompt, remove context that looks adversarial, or reset conversation. |
| tool_calls | tool_use | Model wants to call a tool/function | Parse the tool call and invoke the function. Not an error — expected in agent loops. |
| stop_sequence | stop_sequence | Hit a custom stop sequence you defined | This is intentional if you set stop sequences. Otherwise remove the sequence from the request. |
Timeouts: the silent killer
Timeouts are different from HTTP error codes because they often never produce an HTTP response at all — your HTTP client simply gives up waiting. There are three distinct timeout types to configure in an LLM application, and getting any one of them wrong leads to hanging requests or premature failures.
- Connection timeout — how long to wait for the initial TCP connection to the provider. Should be short: 5–10 seconds. A longer wait here usually means the provider is unreachable, not just slow.
- Read/response timeout — how long to wait for the full response body to arrive. This is the one most developers forget to set. LLMs can take 30–120 seconds for large generations. Set this to at least 120 seconds, or use streaming to get tokens incrementally instead of waiting for the full batch.
- Request timeout (total) — a hard ceiling on the entire request lifecycle. Set it high enough that large generations don't get killed prematurely, but not so high that a hung request blocks your thread forever.
Handling a timeout correctly means treating it exactly like a 503: wait, then retry with backoff. Do not immediately blast another request — if the provider is under heavy load, flooding it with retries makes the problem worse. A standard exponential backoff starts at 1 second, doubles on each failure (1s → 2s → 4s → 8s), and adds a small random jitter (±20%) to prevent every client from retrying at the same instant.
import time
import random
import anthropic
client = anthropic.Anthropic(timeout=120.0) # 120s read timeout
def call_with_backoff(messages, max_retries=4):
delay = 1.0
for attempt in range(max_retries):
try:
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
messages=messages,
)
# Always check stop_reason — 200 OK doesn't mean complete output
if response.stop_reason == "max_tokens":
raise ValueError("Response truncated — raise max_tokens or split the task")
return response
except anthropic.RateLimitError:
if attempt == max_retries - 1:
raise
jitter = random.uniform(0.8, 1.2)
time.sleep(delay * jitter)
delay *= 2
except anthropic.APIStatusError as e:
# 400 / 401 / 403 are not retryable
if e.status_code in (400, 401, 403):
raise
if attempt == max_retries - 1:
raise
jitter = random.uniform(0.8, 1.2)
time.sleep(delay * jitter)
delay *= 2Content refusals and false positives
A content refusal happens when the provider's safety layer intervenes before or during generation. This can happen at two points: the API returns a 400 with an error message like 'The response was filtered due to the prompt triggering content management policy' (pre-generation block), or the model returns 200 OK with finish_reason: content_filter (OpenAI) or stop_reason: refusal (Anthropic Claude 4+).
The Anthropic-specific refusal stop reason was introduced with Claude 4 models. When a streaming classifier detects a potential policy violation mid-generation, the API injects a stop with stop_reason: "refusal" and includes a stop_details object identifying which policy category triggered it. Importantly, Anthropic requires you to reset the conversation context before continuing — the turn that triggered the refusal must be removed or rephrased, because re-sending the same conversation history will trigger another refusal.
False positives — legitimate requests that get blocked — are real and documented. Common triggers include: security-related code examples, medical topics phrased bluntly, legal questions about liability, and creative fiction with dark themes. Strategies to reduce false positives without weakening the model's safety posture:
- Add context in the system prompt. A system message that establishes the use-case ('You are an assistant for a licensed medical practice') can shift the classifier's interpretation of the same user message.
- Rephrase to be less ambiguous. Classifiers are sensitive to phrasing. 'How does buffer overflow exploitation work at a conceptual level?' reads very differently from a more direct phrasing.
- Split the request. If one large prompt covers both sensitive and innocuous content, separating them into smaller calls can prevent the sensitive part from contaminating the classification of the whole.
- Check provider-specific configuration. Enterprise tiers on Azure OpenAI and some Anthropic plans allow adjusting content filter severity levels for specific categories.
Truncated output: the most common silent failure
When a response ends because of a max_tokens limit, you get back a 200 OK with partial text and finish_reason: length (OpenAI) or stop_reason: max_tokens (Anthropic). The output looks valid — it is grammatically plausible text — but it is incomplete. This is a genuine data integrity risk when the output is structured (JSON, code, a list) because incomplete structured output is often worse than no output at all.
The max_tokens (OpenAI) / max_tokens (Anthropic) parameter is a ceiling on output tokens, not a target. If you set it to 256 and the model needs 800 tokens to finish the answer, you get the first 256 tokens and a truncated result. Most providers have a model-specific context window limit that combines input and output tokens — sending a very long conversation leaves little room for output even if your max_tokens is set high.
- Always check the stop reason before parsing structured output. Treat
finish_reason: length/stop_reason: max_tokensas an error that requires action. - Raise max_tokens. The most direct fix. For
gpt-4o, the output limit is up to 16,384 tokens per response; for Claude models it can be up to 8,192 or higher depending on the model and tier. - Split the task. If the output is large by nature (a long document, a full code file), break the task into smaller pieces that each fit comfortably within the limit.
- Use continuation prompting. If truncation is occasional rather than systematic, detect
finish_reason: length, then send a follow-up message like 'Continue from where you left off' and concatenate the outputs. - Reserve output budget. Calculate your input token count before sending. Leave at least the expected output size as headroom. If input + expected output approaches the context window, trim the input or summarise earlier messages.
import openai
client = openai.OpenAI()
def safe_complete(messages, max_tokens=2048):
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
max_tokens=max_tokens,
)
choice = response.choices[0]
if choice.finish_reason == "length":
# Do not silently return truncated output
raise ValueError(
f"Response truncated at {max_tokens} tokens. "
"Raise max_tokens or split the task."
)
if choice.finish_reason == "content_filter":
raise ValueError("Response blocked by content filter. Rephrase the prompt.")
return choice.message.contentGoing deeper
Once you have basic retry logic and stop-reason checks in place, the next layer of reliability comes from circuit breakers, fallback chains, and retry budgets. A circuit breaker tracks recent failure rates and automatically stops sending requests to a provider that is clearly degraded — rather than hammering it with retries and burning quota. Once the failure rate drops below a threshold, the circuit 'closes' and normal traffic resumes.
A fallback chain defines an ordered list of model alternatives. If your primary model (claude-opus-4-5) hits repeated 503 errors, the chain automatically falls back to claude-sonnet-4-5, then to gpt-4o, before giving up. This is especially important for latency-sensitive applications where a provider outage should degrade gracefully rather than take down the feature entirely.
A retry budget prevents a degraded provider from pulling down your whole application. The rule of thumb: total retries should not exceed 10% of total requests at any given moment. If your retry rate climbs above that ceiling, stop retrying and fail fast — a system where 30% of threads are stalled on retry loops will degrade far worse than one that surfaces errors quickly and lets upstream callers handle them.
Streaming and partial-failure handling
Streaming introduces its own class of partial failures. A stream can be interrupted mid-token — the HTTP connection drops, or the provider sends an error event partway through generation. Your stream consumer should accumulate tokens and track whether the stream ended with a stop event or was interrupted. If the stream ends without a clean stop event, treat it as an error and either retry or surface it to the user as incomplete output.
Observability: log everything, alert on the right things
Every LLM call should log: the model name, the HTTP status code, the stop reason, input token count, output token count, and latency. Alert on: finish_reason: length rate above a threshold (indicates systematic token budget issues), 429 error rate trending up (indicates approaching quota), and p95 latency rising sharply (often a leading indicator of provider trouble before 503 errors start). These four metrics catch the vast majority of production LLM reliability issues before they become user-visible.
FAQ
Why does my LLM response get cut off in the middle of a sentence?
Your response was truncated because it hit the max_tokens limit. Check the finish_reason field in the API response — if it says length (OpenAI) or max_tokens (Anthropic), the model ran out of output budget. Raise the max_tokens parameter, split the task into smaller requests, or use a continuation prompt to pick up where the last response stopped.
What is the difference between a 429 error and a 503 error from an LLM provider?
A 429 Too Many Requests means you have exceeded your rate limit — too many requests per minute, too many tokens per minute, or your account quota is exhausted. A 503 Service Unavailable means the provider's servers are overloaded or temporarily down, regardless of your usage. Both are retryable with exponential backoff, but a persistent 429 may require requesting a higher quota tier.
What does finish_reason: content_filter mean and how do I fix it?
finish_reason: content_filter (OpenAI) or stop_reason: refusal (Anthropic Claude 4+) means the model's safety classifier blocked the response. Do not simply retry the same request — it will be blocked again. Try rephrasing the prompt to be less ambiguous, add clarifying context in the system prompt (explaining your use-case), or split the request to isolate the triggering content.
How long should I set the timeout for an LLM API call?
Set your read/response timeout to at least 120 seconds for non-streaming calls, because large generations can legitimately take that long. For streaming calls you can use a shorter per-chunk timeout (10–30 seconds) since tokens arrive incrementally. Keep your connection timeout short (5–10 seconds) — a slow initial connection usually means the endpoint is unreachable, not slow.
Should I retry a 400 Bad Request error from an LLM API?
No. A 400 means the request itself is malformed — invalid JSON, a missing required field, an unrecognised model name, or a message that exceeds the context window. The exact same request will fail every time. Fix the payload (check the error message body for specifics) before sending again. The same logic applies to 401 and 403 errors.
How do I tell if my LLM response is a refusal versus an error?
A refusal typically returns 200 OK with the model generating a polite decline in plain text, or (on Anthropic Claude 4+ streaming) a stop_reason: refusal with no assistant content. A hard error is an HTTP status code other than 2xx, or a 400 with an error body describing a policy violation. The key distinction: refusals are model-layer decisions; errors are infrastructure or request-layer failures.