In plain English
When your app calls a large language model, it usually talks straight to the provider's API — OpenAI, Anthropic, Google, and so on. That works, but it gives you no shared view of what is happening. You can't easily see how much each call costs, you repay for identical questions over and over, and if the provider has a bad minute your app has a bad minute too.

Cloudflare AI Gateway is a thin proxy you put between your app and those providers. Your code still sends the same request, but instead of pointing at the provider's URL directly, it points at the gateway's URL. The gateway forwards the call, watches everything that passes through, and quietly adds useful features: it logs every request, caches repeated answers, enforces rate limits, retries failed calls, and can fall back to a second model when the first one is down.
Think of it like a smart toll booth on the road between your app and the model. Every car (request) still reaches its destination, but the booth keeps a record of traffic, waves through cars it has seen before without charging again, slows down anyone going too fast, and reroutes traffic when the main road is blocked. You don't rebuild the road — you just route through the booth.
Why it matters
Calling an LLM API directly is fine for a demo. In production, the same three problems show up again and again, and a gateway is the standard place to solve all of them at once.
- No visibility. Without a central point, you can't answer simple questions: how many requests did we make today, which feature is burning the most tokens, what is our average latency, how often do calls fail? A gateway logs every request and response in one dashboard, so cost and usage stop being a mystery.
- Wasted spend. Real apps ask the same things repeatedly — the same FAQ, the same system prompt, the same popular query. Paying the model again for an identical answer is pure waste. A gateway can cache responses and serve the repeat for free and instantly.
- Fragile reliability. Providers have outages, rate-limit you, and time out. If your only path is one provider, their bad day is your outage. A gateway adds retries and fallback to a second model, so a single failure doesn't reach your users.
The reason teams reach for a managed gateway like Cloudflare's, rather than building their own, is that you get all of this with almost no work. You don't run any servers, you don't maintain a caching layer, and the change to your code is usually a single line: swap the base URL your client points at. The gateway runs on Cloudflare's edge network, so it adds very little latency while giving you observability, cost control, and resilience that would otherwise be a whole internal project.
How it works
The core trick is dead simple: change where your request is sent. Instead of calling the provider's endpoint, you call a gateway endpoint that includes your account and gateway name. The gateway forwards your request to the real provider, gets the response, applies its features on the way back, and hands the result to your app. To your code, it still looks like a normal LLM API call.
Because every request now flows through one place, the gateway can layer on a stack of features. Each one is optional and configured per gateway or per request.
The features it adds
Caching is the headline cost-saver. When a request comes in, the gateway computes a cache key — by default a fingerprint of the request (the model, the messages, the key parameters). If it has seen an identical request before and the cached answer hasn't expired, it returns the stored response immediately, without ever calling the provider. That repeat is free and near-instant. You control how long entries live (the time-to-live) and can skip the cache for requests that must always be fresh.
Rate limiting lets you cap how many requests pass through in a time window — useful to protect your budget from a runaway loop or an abusive user. Retries automatically re-send a call that failed or timed out, smoothing over transient provider hiccups. Fallback chains providers: if the first model errors out, the gateway transparently sends the same request to a backup model so the user still gets an answer.
from openai import OpenAI
# Before: talk to the provider directly.
# client = OpenAI(api_key="sk-...")
# After: route through your AI Gateway. Same SDK, same request body —
# only the base_url changes. The gateway now logs, caches, and can
# fall back, with no other code edits.
client = OpenAI(
api_key="sk-...",
base_url=(
"https://gateway.ai.cloudflare.com/v1"
"/<account_id>/<gateway_name>/openai"
),
)
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Summarize RAG in one line."}],
)
print(resp.choices[0].message.content)That is the whole integration in the simplest case: keep your existing SDK and request, point its base URL at the gateway, and the gateway does the rest. More advanced behavior (custom cache keys, per-route limits, fallback order) is configured in the gateway settings or via request headers, not by rewriting your app.
What the analytics actually tell you
Observability is the quiet reason most teams adopt a gateway first — even before they turn on caching. Because every call is logged in one place, you get a dashboard that answers the operational questions that are painful to reconstruct from scattered provider bills and app logs.
| You can see | Why it matters |
|---|---|
| Requests over time | Spot traffic spikes, runaway loops, and usage trends per app or feature. |
| Tokens and cost | Attribute spend to the feature or team that caused it, not just one big invoice. |
| Latency | Track end-to-end and time-to-first-token so you know when responses feel slow. |
| Error rate | Catch a provider getting flaky or your requests being rejected, early. |
| Cache hit rate | Measure how much money caching is actually saving you. |
Logging full request and response bodies is powerful but sensitive — those payloads can contain user data. Treat the log store as you would any system holding personal information, and use the gateway's controls to limit or scrub what gets stored when needed.
Managed gateway vs rolling your own
You don't need a managed gateway. Teams have long built the same features in-house — a small proxy service with a cache, a logging table, and retry logic. The question is whether that is worth owning. A managed edge gateway and a self-hosted one solve the same problem with very different tradeoffs.
- No servers to run or scale
- Set up in minutes — swap the base URL
- Caching, limits, logging built in
- Runs at the edge, near users
- Less control over internals
- You run and patch the service
- Full control over every behavior
- Build/maintain cache + logs yourself
- Lives in your own network/region
- More work, more flexibility
For most teams shipping an LLM feature, the managed option wins early: it removes an entire infrastructure project and gets you observability and cost control on day one. Teams choose a self-hosted gateway (or a library like an open-source proxy) when they need data to stay inside their own network, want deep custom routing logic, or already run heavy infrastructure where one more service is no burden. The two are not mutually exclusive — some teams run a self-hosted proxy for routing and point it through a managed gateway for edge caching and analytics.
It also pairs naturally with other cost levers. A gateway's exact-match cache complements prompt caching at the provider and semantic caching, which reuses answers to similar (not just identical) prompts. Together they attack token costs from several angles at once.
Going deeper
Once the basics click, the interesting decisions are about how you tune each feature. The same gateway can be a lightweight logger or a serious reliability layer depending on configuration.
Cache keys are everything. The default key is an exact fingerprint of the request, so two calls only share a cache entry if they are byte-for-byte identical. That means a stray timestamp, a changing user ID in the prompt, or a slightly different temperature all produce cache misses. To raise your hit rate, keep cacheable requests deterministic and consider custom cache keys that ignore fields that shouldn't affect the answer. But beware the opposite failure: a key that is too loose can serve one user's answer to a different question.
Exact-match caching has a ceiling. Because it only catches identical requests, it does nothing for the many real cases where users ask the same thing in different words. That is where semantic caching comes in — it embeds the prompt and returns a stored answer when a new prompt is similar enough. Semantic caching catches far more traffic but adds a similarity-threshold risk that exact matching never has. Knowing which one a layer uses tells you what it can and cannot save.
Fallback needs thought, not just a toggle. A backup model is great for availability, but the backup may have a different price, speed, or quality. Decide whether a fallback that produces a worse answer is acceptable for a given route, and watch your analytics for how often the fallback fires — frequent fallbacks usually mean your primary provider, or your rate limits, need attention rather than a band-aid.
Mind latency and time-to-first-token. Adding a hop can add latency, which is why running the gateway at the edge matters. For streaming responses, the metric users feel is time-to-first-token; a well-placed gateway should be near-invisible here, while a cache hit makes the response effectively instant. If you ever see the gateway adding noticeable delay, check whether streaming is being passed through correctly and whether you are routing to a far region.
Where to go next: read about reducing LLM latency and the broader cost-and-latency toolkit, then read Cloudflare's docs for the exact configuration of cache TTLs, rate-limit rules, and fallback chains. The durable idea, no matter which gateway you pick, is the same: put one well-instrumented door in front of all your model traffic, and cost, reliability, and visibility all become things you can manage instead of guess at.
FAQ
What is Cloudflare AI Gateway used for?
It is a proxy you place between your app and LLM providers to add observability, response caching, rate limiting, retries, and model fallback. The goal is to cut cost and improve reliability for your model calls without running any infrastructure of your own.
How do I connect my app to Cloudflare AI Gateway?
In the simplest case you change a single line: point your existing SDK's base URL at the gateway endpoint (which includes your account ID and gateway name) instead of the provider's URL. The request body and the rest of your code stay the same.
Does AI Gateway support multiple providers like OpenAI and Claude?
Yes. It works as a front door for several major providers, and one of its key features is fallback — if your primary model fails, the gateway can route the same request to a backup provider so users still get an answer.
How does caching in AI Gateway save money?
When an identical request comes in again, the gateway returns the stored answer instead of calling the provider, so you don't pay for the repeat and the response is near-instant. By default it caches on an exact match of the request, and you control how long entries live.
Is AI Gateway the same as semantic caching?
No. Its built-in cache is exact-match: it only reuses an answer when the request is identical. Semantic caching reuses answers for prompts that mean the same thing in different words, which catches more traffic but adds a similarity-threshold risk.
Does adding a gateway slow down my LLM calls?
It adds one network hop, which is why running it at the edge — close to users — matters. In practice the overhead is small, and a cache hit actually makes the response far faster because the provider is never called.