Understanding LLM API Errors: Timeouts, Refusals, and Truncated Output

Q: Why does my LLM response get cut off in the middle of a sentence?

Your response was truncated because it hit the `max_tokens` limit. Check the `finish_reason` field in the API response — if it says `length` (OpenAI) or `max_tokens` (Anthropic), the model ran out of output budget. Raise the `max_tokens` parameter, split the task into smaller requests, or use a continuation prompt to pick up where the last response stopped.

Q: What is the difference between a 429 error and a 503 error from an LLM provider?

A `429 Too Many Requests` means you have exceeded your rate limit — too many requests per minute, too many tokens per minute, or your account quota is exhausted. A `503 Service Unavailable` means the provider's servers are overloaded or temporarily down, regardless of your usage. Both are retryable with exponential backoff, but a persistent `429` may require requesting a higher quota tier.

Q: What does finish_reason: content_filter mean and how do I fix it?

`finish_reason: content_filter` (OpenAI) or `stop_reason: refusal` (Anthropic Claude 4+) means the model's safety classifier blocked the response. Do not simply retry the same request — it will be blocked again. Try rephrasing the prompt to be less ambiguous, add clarifying context in the system prompt (explaining your use-case), or split the request to isolate the triggering content.

Q: Should I retry a 400 Bad Request error from an LLM API?

No. A `400` means the request itself is malformed — invalid JSON, a missing required field, an unrecognised model name, or a message that exceeds the context window. The exact same request will fail every time. Fix the payload (check the error message body for specifics) before sending again. The same logic applies to `401` and `403` errors.

Q: How do I tell if my LLM response is a refusal versus an error?

A refusal typically returns `200 OK` with the model generating a polite decline in plain text, or (on Anthropic Claude 4+ streaming) a `stop_reason: refusal` with no assistant content. A hard error is an HTTP status code other than `2xx`, or a `400` with an error body describing a policy violation. The key distinction: refusals are model-layer decisions; errors are infrastructure or request-layer failures.

Recognize every failure mode an LLM call can hit — status codes, stop reasons, refusals, cutoffs — and know the right fix for each.

INTERMEDIATE14 MIN READUPDATED 2026-06-12

In plain English

Every LLM API call can fail in a surprising number of ways — and most of them look identical at first glance: your code gets back something unexpected and the output is wrong, missing, or absent entirely. The good news is that there are really only two categories of failure: the HTTP layer (the request never completed, or completed with an error status code) and the model layer (the request completed fine, but the model stopped early or declined to answer).

Think of it like a restaurant order. The first category is the waiter failing: the kitchen is on fire (503 Service Unavailable), your reservation expired (401 Unauthorized), or you ordered something that isn't on the menu (400 Bad Request). The second category is the chef stopping mid-dish: they ran out of the special ingredient (max_tokens cutoff), the recipe calls for a sauce they won't make (content_filter refusal), or the dish is simply finished (stop / end_turn).

Understanding which category you are in is the first diagnostic step, because the remedies are completely different. A 503 calls for a retry with exponential backoff. A finish_reason: length calls for raising your token budget or splitting the task. A content refusal calls for rephrasing the prompt. Retrying a refused or malformed request is wasted money.

Why it matters

Silent failures are the most dangerous bugs in LLM applications. If your summarizer truncates a legal document mid-sentence and returns 200 OK, nothing throws an exception. The user sees a plausible but incomplete summary. The field finish_reason (OpenAI) or stop_reason (Anthropic) is the only signal that the output was cut off — and most beginner code ignores it entirely.

Rate-limit errors (429) and server errors (503) come up constantly in production. LLM providers throttle requests aggressively to protect capacity. Without a retry strategy, a spike of parallel calls will cause cascading failures across your app. With the right backoff logic, the same spike barely registers as a blip.

Content refusals are a third source of unexpected behaviour. A model will sometimes decline to answer a perfectly reasonable question — a false positive from safety classifiers, a topic that triggers a guardrail, or a prompt phrasing that looks adversarial. Knowing the exact signal for a refusal (versus a truncation or a network timeout) means you can route around it intelligently rather than silently returning an empty response to the user.

Production reliability — unhandled 429 and 503 errors are the most common reason LLM-powered features go down under load.
Data integrity — finish_reason: length is a silent truncation; checking it prevents incomplete structured output from being parsed and acted on.
User experience — distinguishing a transient server error (retry-able) from a content refusal (needs rephrasing) lets you show the right message instead of a generic 'something went wrong.'
Cost control — retrying non-retryable errors (400, 401, 403) burns tokens and money with zero chance of success.

How the error layers work

An LLM API call passes through two distinct checkpoints before you see output. Understanding each checkpoint and its failure modes is the foundation of reliable error handling.

// LLM API call — what can fail where

Your code sends HTTP requestAPI key, model, messages, parametersHTTP layer check401 / 403 / 400 / 429 / 500 / 503 / 504Model generation starts200 OK — request acceptedModel layer checkstop_reason: length / content_filter / refusalResponse returnedCheck finish_reason before trusting output

HTTP status codes — the first checkpoint

Status	Name	Cause	Action
400	Bad Request	Malformed JSON, missing required field, invalid model name, message too long for context window	Fix the request. Do NOT retry — the same request will always fail.
401	Unauthorized	Missing or invalid API key	Check your key. Rotate it if compromised. Do NOT retry.
403	Forbidden	Key is valid but lacks permission for this model or feature	Check your account tier or organisation policy. Do NOT retry.
429	Too Many Requests	Rate limit hit: requests per minute, tokens per minute, or daily quota	Wait and retry with exponential backoff. Respect the Retry-After header.
500	Internal Server Error	Provider-side bug or unexpected model error	Retry with backoff (1–3 attempts). If persistent, check the provider status page.
503	Service Unavailable	Provider overloaded or in maintenance	Retry with backoff. Consider falling back to a different model or provider.
504	Gateway Timeout	Request took longer than the provider's gateway timeout	Retry. For large inputs, try chunking to reduce generation time.

Stop reasons / finish reasons — the second checkpoint

When the HTTP call succeeds (200 OK), the response body includes a field that tells you why the model stopped generating. OpenAI's Chat Completions API calls this finish_reason; Anthropic's Messages API calls it stop_reason. Different names, same concept.

OpenAI finish_reason	Anthropic stop_reason	Meaning	Action
stop	end_turn	Model finished naturally — this is the happy path	None needed. Output is complete.
length	max_tokens	Hit the max token limit before finishing	Raise max_tokens, split the task, or use continuation prompting.
content_filter	refusal (Claude 4+)	Safety classifier blocked the output	Rephrase the prompt, remove context that looks adversarial, or reset conversation.
tool_calls	tool_use	Model wants to call a tool/function	Parse the tool call and invoke the function. Not an error — expected in agent loops.
stop_sequence	stop_sequence	Hit a custom stop sequence you defined	This is intentional if you set stop sequences. Otherwise remove the sequence from the request.

// Retryable vs non-retryable failures

LLM call fails or truncates

Retry with backoff429, 500, 502, 503, 504, timeout

Fix the request400, 401, 403, finish_reason: length

Rephrase the promptcontent_filter, refusal stop_reason

Timeouts: the silent killer

Timeouts are different from HTTP error codes because they often never produce an HTTP response at all — your HTTP client simply gives up waiting. There are three distinct timeout types to configure in an LLM application, and getting any one of them wrong leads to hanging requests or premature failures.

Connection timeout — how long to wait for the initial TCP connection to the provider. Should be short: 5–10 seconds. A longer wait here usually means the provider is unreachable, not just slow.
Read/response timeout — how long to wait for the full response body to arrive. This is the one most developers forget to set. LLMs can take 30–120 seconds for large generations. Set this to at least 120 seconds, or use streaming to get tokens incrementally instead of waiting for the full batch.
Request timeout (total) — a hard ceiling on the entire request lifecycle. Set it high enough that large generations don't get killed prematurely, but not so high that a hung request blocks your thread forever.

Handling a timeout correctly means treating it exactly like a 503: wait, then retry with backoff. Do not immediately blast another request — if the provider is under heavy load, flooding it with retries makes the problem worse. A standard exponential backoff starts at 1 second, doubles on each failure (1s → 2s → 4s → 8s), and adds a small random jitter (±20%) to prevent every client from retrying at the same instant.

pythonpython

import time
import random
import anthropic

client = anthropic.Anthropic(timeout=120.0)  # 120s read timeout

def call_with_backoff(messages, max_retries=4):
    delay = 1.0
    for attempt in range(max_retries):
        try:
            response = client.messages.create(
                model="claude-opus-4-5",
                max_tokens=1024,
                messages=messages,
            )
            # Always check stop_reason — 200 OK doesn't mean complete output
            if response.stop_reason == "max_tokens":
                raise ValueError("Response truncated — raise max_tokens or split the task")
            return response
        except anthropic.RateLimitError:
            if attempt == max_retries - 1:
                raise
            jitter = random.uniform(0.8, 1.2)
            time.sleep(delay * jitter)
            delay *= 2
        except anthropic.APIStatusError as e:
            # 400 / 401 / 403 are not retryable
            if e.status_code in (400, 401, 403):
                raise
            if attempt == max_retries - 1:
                raise
            jitter = random.uniform(0.8, 1.2)
            time.sleep(delay * jitter)
            delay *= 2

Content refusals and false positives

A content refusal happens when the provider's safety layer intervenes before or during generation. This can happen at two points: the API returns a 400 with an error message like 'The response was filtered due to the prompt triggering content management policy' (pre-generation block), or the model returns 200 OK with finish_reason: content_filter (OpenAI) or stop_reason: refusal (Anthropic Claude 4+).

The Anthropic-specific refusal stop reason was introduced with Claude 4 models. When a streaming classifier detects a potential policy violation mid-generation, the API injects a stop with stop_reason: "refusal" and includes a stop_details object identifying which policy category triggered it. Importantly, Anthropic requires you to reset the conversation context before continuing — the turn that triggered the refusal must be removed or rephrased, because re-sending the same conversation history will trigger another refusal.

False positives — legitimate requests that get blocked — are real and documented. Common triggers include: security-related code examples, medical topics phrased bluntly, legal questions about liability, and creative fiction with dark themes. Strategies to reduce false positives without weakening the model's safety posture:

Add context in the system prompt. A system message that establishes the use-case ('You are an assistant for a licensed medical practice') can shift the classifier's interpretation of the same user message.
Rephrase to be less ambiguous. Classifiers are sensitive to phrasing. 'How does buffer overflow exploitation work at a conceptual level?' reads very differently from a more direct phrasing.
Split the request. If one large prompt covers both sensitive and innocuous content, separating them into smaller calls can prevent the sensitive part from contaminating the classification of the whole.
Check provider-specific configuration. Enterprise tiers on Azure OpenAI and some Anthropic plans allow adjusting content filter severity levels for specific categories.

Truncated output: the most common silent failure

When a response ends because of a max_tokens limit, you get back a 200 OK with partial text and finish_reason: length (OpenAI) or stop_reason: max_tokens (Anthropic). The output looks valid — it is grammatically plausible text — but it is incomplete. This is a genuine data integrity risk when the output is structured (JSON, code, a list) because incomplete structured output is often worse than no output at all.

The max_tokens (OpenAI) / max_tokens (Anthropic) parameter is a ceiling on output tokens, not a target. If you set it to 256 and the model needs 800 tokens to finish the answer, you get the first 256 tokens and a truncated result. Most providers have a model-specific context window limit that combines input and output tokens — sending a very long conversation leaves little room for output even if your max_tokens is set high.

Always check the stop reason before parsing structured output. Treat finish_reason: length / stop_reason: max_tokens as an error that requires action.
Raise max_tokens. The most direct fix. For gpt-4o, the output limit is up to 16,384 tokens per response; for Claude models it can be up to 8,192 or higher depending on the model and tier.
Split the task. If the output is large by nature (a long document, a full code file), break the task into smaller pieces that each fit comfortably within the limit.
Use continuation prompting. If truncation is occasional rather than systematic, detect finish_reason: length, then send a follow-up message like 'Continue from where you left off' and concatenate the outputs.
Reserve output budget. Calculate your input token count before sending. Leave at least the expected output size as headroom. If input + expected output approaches the context window, trim the input or summarise earlier messages.

pythonpython

import openai

client = openai.OpenAI()

def safe_complete(messages, max_tokens=2048):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        max_tokens=max_tokens,
    )
    choice = response.choices[0]
    if choice.finish_reason == "length":
        # Do not silently return truncated output
        raise ValueError(
            f"Response truncated at {max_tokens} tokens. "
            "Raise max_tokens or split the task."
        )
    if choice.finish_reason == "content_filter":
        raise ValueError("Response blocked by content filter. Rephrase the prompt.")
    return choice.message.content

Going deeper

Once you have basic retry logic and stop-reason checks in place, the next layer of reliability comes from circuit breakers, fallback chains, and retry budgets. A circuit breaker tracks recent failure rates and automatically stops sending requests to a provider that is clearly degraded — rather than hammering it with retries and burning quota. Once the failure rate drops below a threshold, the circuit 'closes' and normal traffic resumes.

A fallback chain defines an ordered list of model alternatives. If your primary model (claude-opus-4-5) hits repeated 503 errors, the chain automatically falls back to claude-sonnet-4-5, then to gpt-4o, before giving up. This is especially important for latency-sensitive applications where a provider outage should degrade gracefully rather than take down the feature entirely.

A retry budget prevents a degraded provider from pulling down your whole application. The rule of thumb: total retries should not exceed 10% of total requests at any given moment. If your retry rate climbs above that ceiling, stop retrying and fail fast — a system where 30% of threads are stalled on retry loops will degrade far worse than one that surfaces errors quickly and lets upstream callers handle them.

Streaming and partial-failure handling

Streaming introduces its own class of partial failures. A stream can be interrupted mid-token — the HTTP connection drops, or the provider sends an error event partway through generation. Your stream consumer should accumulate tokens and track whether the stream ended with a stop event or was interrupted. If the stream ends without a clean stop event, treat it as an error and either retry or surface it to the user as incomplete output.

Observability: log everything, alert on the right things

Every LLM call should log: the model name, the HTTP status code, the stop reason, input token count, output token count, and latency. Alert on: finish_reason: length rate above a threshold (indicates systematic token budget issues), 429 error rate trending up (indicates approaching quota), and p95 latency rising sharply (often a leading indicator of provider trouble before 503 errors start). These four metrics catch the vast majority of production LLM reliability issues before they become user-visible.

FAQ

Why does my LLM response get cut off in the middle of a sentence?

Your response was truncated because it hit the max_tokens limit. Check the finish_reason field in the API response — if it says length (OpenAI) or max_tokens (Anthropic), the model ran out of output budget. Raise the max_tokens parameter, split the task into smaller requests, or use a continuation prompt to pick up where the last response stopped.

What is the difference between a 429 error and a 503 error from an LLM provider?

A 429 Too Many Requests means you have exceeded your rate limit — too many requests per minute, too many tokens per minute, or your account quota is exhausted. A 503 Service Unavailable means the provider's servers are overloaded or temporarily down, regardless of your usage. Both are retryable with exponential backoff, but a persistent 429 may require requesting a higher quota tier.

What does finish_reason: content_filter mean and how do I fix it?

finish_reason: content_filter (OpenAI) or stop_reason: refusal (Anthropic Claude 4+) means the model's safety classifier blocked the response. Do not simply retry the same request — it will be blocked again. Try rephrasing the prompt to be less ambiguous, add clarifying context in the system prompt (explaining your use-case), or split the request to isolate the triggering content.

How long should I set the timeout for an LLM API call?

Set your read/response timeout to at least 120 seconds for non-streaming calls, because large generations can legitimately take that long. For streaming calls you can use a shorter per-chunk timeout (10–30 seconds) since tokens arrive incrementally. Keep your connection timeout short (5–10 seconds) — a slow initial connection usually means the endpoint is unreachable, not slow.

Should I retry a 400 Bad Request error from an LLM API?

No. A 400 means the request itself is malformed — invalid JSON, a missing required field, an unrecognised model name, or a message that exceeds the context window. The exact same request will fail every time. Fix the payload (check the error message body for specifics) before sending again. The same logic applies to 401 and 403 errors.

How do I tell if my LLM response is a refusal versus an error?

A refusal typically returns 200 OK with the model generating a polite decline in plain text, or (on Anthropic Claude 4+ streaming) a stop_reason: refusal with no assistant content. A hard error is an HTTP status code other than 2xx, or a 400 with an error body describing a policy violation. The key distinction: refusals are model-layer decisions; errors are infrastructure or request-layer failures.

// In plain English

// Why it matters

// How the error layers work

HTTP status codes — the first checkpoint

Stop reasons / finish reasons — the second checkpoint

// Timeouts: the silent killer

// Content refusals and false positives

// Truncated output: the most common silent failure

// Going deeper

Streaming and partial-failure handling

Observability: log everything, alert on the right things

// FAQ

// Further reading

// Related

In plain English

Why it matters

How the error layers work

Timeouts: the silent killer

Content refusals and false positives

Truncated output: the most common silent failure

Going deeper

FAQ

Further reading

Related