What Is LiteLLM? One API for 100+ LLMs

After reading, you'll understand what LiteLLM is, how it normalizes 100+ models behind one OpenAI-format API, and how its budgets and fallbacks cut cost and risk.

INTERMEDIATE10 MIN READUPDATED 2026-06-14

DOCSdocs.litellm.ai BerriAI/litellm50.9k

In plain English

Every LLM provider speaks a slightly different dialect. OpenAI wants your request shaped one way, Anthropic's Claude another, Google's Gemini a third. The fields differ, the way you stream tokens differs, even the way errors come back differs. If your app talks to three providers, you end up writing and maintaining three different clients — and a fourth model means a fourth integration.

LiteLLM — illustration — LiteLLM — image-optimizer.cyberriskalliance.com

LiteLLM is the universal adapter that makes all of them speak one language. It's an open-source tool that puts a single, OpenAI-format API in front of 100+ models from many providers. You write your code once, in the format you already know, and LiteLLM translates each call into whatever the target provider expects — then translates the response back into a single, predictable shape.

Think of it like a travel power adapter. Your laptop charger has one plug; every country has a different socket. Instead of buying a new charger for each country, you carry one adapter that fits them all. LiteLLM is that adapter for language models: your one "plug" (OpenAI-format code) fits every "socket" (provider) behind it.

Why it matters

A gateway like LiteLLM solves a cluster of problems that show up the moment an LLM app grows past a single model and a single developer.

Avoid lock-in. If your whole codebase is wired to one provider's exact SDK, switching models means a rewrite. Behind a unified API, swapping gpt-4o for claude-sonnet or a local model is a one-line config change, not a refactor.
Use the best model per task. Cheap model for classification, a strong reasoning model for hard questions, a vision model for images — all through one interface, instead of juggling three SDKs in the same service.
Control spend. Self-serve LLM access inside a company quietly turns into a surprise bill. A gateway can attach budgets and spend limits per team, per key, or per model and cut off requests that blow the cap.
Stay up when a provider wobbles. Providers have outages and rate limits. A gateway can retry and fall back to another model automatically, so one provider's bad afternoon doesn't take your app down.
See everything in one place. Because every call flows through it, the gateway is the natural spot for logging, cost tracking, and rate limiting — one dashboard instead of N provider consoles.

This is why a gateway is considered a core layer of an LLMOps stack — the operational plumbing around your models. It sits between your application code and the providers, and it owns the cross-cutting concerns (cost, reliability, observability, access) that you'd otherwise re-implement in every service.

How it works

At its core LiteLLM does one job: it accepts a request in the OpenAI Chat Completions format, looks at which model you asked for, rewrites the request into that provider's native schema, sends it, and maps the provider's reply back into the OpenAI response shape. Streaming, function/tool calls, and errors all get normalized the same way, so your code only ever deals with one format.

// A request through the gateway

Your appOpenAI-format requestTranslateto provider schemaRoutepick a deploymentProviderOpenAI / Anthropic / …Normalizeback to OpenAI shape

The SDK: translation inside your process

The simplest form is the Python SDK. You call one function, pass a model string like "anthropic/claude-sonnet-4-5" or "gemini/gemini-2.5-pro", and LiteLLM handles the rest in-process. No extra service to run — it's just a library.

same call, any providerpython

from litellm import completion

# Switch providers by changing only the model string.
for model in ["openai/gpt-4o", "anthropic/claude-sonnet-4-5", "gemini/gemini-2.5-pro"]:
    resp = completion(
        model=model,
        messages=[{"role": "user", "content": "Say hi in five words."}],
    )
    # Response is always in OpenAI format, whatever the provider.
    print(model, "->", resp.choices[0].message.content)

The proxy: one shared gateway for everyone

The proxy server is the SDK wrapped as a standalone service with its own OpenAI-compatible endpoint. Your apps — in any language — point their OpenAI client at the proxy's URL instead of api.openai.com. The proxy holds the real provider keys, so individual apps never see them. This is where the operations features live: virtual keys, budgets, rate limits, fallbacks, and centralized logging.

config.yaml — a model list with a fallbackyaml

model_list:
  - model_name: smart            # the alias your apps request
    litellm_params:
      model: anthropic/claude-sonnet-4-5
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: smart-backup
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY

router_settings:
  fallbacks: [{ "smart": ["smart-backup"] }]   # if 'smart' fails, try the backup

A few concepts make the proxy click:

Model list. A config file that maps friendly aliases (like smart) to real provider deployments. Apps ask for the alias; the gateway decides what actually runs behind it. Point an alias at a new model and every app upgrades at once.
Virtual keys. Instead of handing out raw provider keys, you issue per-team or per-app keys at the gateway. Each can carry its own budget, rate limit, and allowed-model list — and you can revoke one without touching the others.
Fallback chains. An ordered list of backups. If the primary errors out or hits a rate limit, the router transparently retries the next model in the chain, so the caller still gets an answer.

SDK vs proxy: which form do you need?

The single most common point of confusion is treating "LiteLLM" as one thing. It's two deployment shapes over one translation core. Here's how they differ in practice.

Aspect	Python SDK	Proxy server (gateway)
What it is	A library you `import`	A standalone service with a URL
Runs	Inside your app process	As its own deployment
Languages	Python only	Any language (OpenAI-compatible HTTP)
Holds provider keys	Your app does	The proxy does (apps never see them)
Budgets & virtual keys	Not really	Yes — its main reason to exist
Central logging & limits	Per app	Once, for all callers
Best for	A single service, quick start	Many teams/apps sharing models

Both give you the headline win — one OpenAI-format call reaching 100+ models. The proxy adds the shared governance layer on top. Many teams begin with the SDK in a prototype, then move the same logic behind the proxy once a second app needs the same models.

Self-hosted LiteLLM vs a hosted gateway

LiteLLM is something you run. A hosted gateway like OpenRouter is something you sign up for. They solve the same surface problem — one API in front of many models — but the trade-offs are different, and the choice is mostly about who owns the infrastructure and the billing.

// Two ways to get one API for many models

LiteLLM (self-hosted)

You run the SDK or proxy
You bring your own provider keys
Separate bill per provider
Full control of data and routing
Open-source, no per-call markup

OpenRouter (hosted)

Managed — zero infrastructure
One key, one account
One consolidated bill
Routing handled for you
Convenience for a small fee/margin

Pick self-hosted LiteLLM when you want your own keys, your own logs, data kept inside your network, and no middle-man margin — at the cost of running a service. Pick a hosted gateway when you want to ship today with one signup and no servers to operate. The two aren't mutually exclusive: a LiteLLM proxy can even route to a hosted gateway as just another provider behind it.

Common pitfalls

Assuming every model supports every feature. A unified API normalizes the shape of a request, not each model's capabilities. Ask a model with no vision or tool-calling support to do those things and you'll still get an error — translation can't add a feature the provider lacks.
Leaking provider keys past the proxy. The proxy's whole security value is that apps hold virtual keys and only the gateway holds the real ones. If apps still ship raw provider keys, you've kept the maintenance and lost the protection.
Fallbacks that hide real failures. Automatic fallback is great for outages, but if you fall back silently on every error you can mask a broken prompt or a misconfigured model. Log when a fallback fires and alert on the rate.
Forgetting it's another hop. The proxy adds a small amount of latency and one more service to keep alive. Run it close to your apps, give it sane timeouts, and treat its uptime as part of your system's uptime.
Budgets set and forgotten. Spend limits only help if someone watches them. A team that quietly hits its cap will see failed requests, not a warning — wire the budget data into your normal alerting.

Going deeper

Once the basics click, the interesting work is in the router and the operations layer around it. A few directions worth knowing.

Smarter routing. Beyond a fixed fallback chain, a gateway can load-balance across several deployments of the same model (say, two regions or two API keys) and route by latency, least-busy, or cost. This is how teams squeeze more throughput out of provider rate limits without changing app code — it pairs naturally with reducing LLM latency and time to first token work.

Caching at the gateway. Because every request passes through one place, the gateway is a natural home for a response cache. Exact-match caching is cheap and safe; semantic caching reuses answers for similar (not identical) questions. Both are part of the broader effort to cut token costs, and it helps to understand how they relate to prompt vs semantic caching.

Observability and governance. Production deployments wire the proxy into logging and cost-tracking backends, tag spend by team and key, and enforce per-key rate limits. The model list plus virtual keys effectively becomes your org's control plane for LLM access: one place to add a model, set a budget, or revoke access.

Where it stops. A gateway normalizes interfaces and manages traffic; it does not make a weak model strong or fix a bad prompt. It also can't fully erase provider differences in behavior — two models given the identical normalized request will still answer differently. Treat LiteLLM as the reliability-and-cost layer of your stack, and keep evaluating model quality separately. The durable idea: put one stable seam between your app and the churning model market, so swapping models stays a config change, not a rewrite.

FAQ

What is LiteLLM used for?

LiteLLM gives you one OpenAI-format API in front of 100+ models from many providers, so you can call any of them with the same code. Teams use it to avoid provider lock-in, switch models with a config change, and — through its proxy server — add budgets, virtual keys, retries, and central logging.

What is the difference between the LiteLLM SDK and the LiteLLM proxy?

The SDK is a Python library you import into your app; translation happens in-process and your app holds the provider keys. The proxy is a standalone service with its own OpenAI-compatible URL that any language can call — it holds the real keys and adds shared budgets, virtual keys, rate limits, and logging. Same engine, two deployment shapes.

LiteLLM vs OpenRouter — which should I use?

They solve the same problem (one API for many models) but differ on ownership. LiteLLM is self-hosted: you run it, bring your own provider keys, and get one bill per provider with full control. OpenRouter is hosted: one signup, one key, one consolidated bill, and no servers to run. Choose self-hosting for control and your own keys, hosting for speed and zero infrastructure.

Does LiteLLM cost money?

LiteLLM itself is open-source and free to self-host; you only pay the underlying model providers for tokens. There is a commercial enterprise tier with extra features and support, but the core SDK and proxy you can run yourself at no license cost.

Do I still pay each provider when I use LiteLLM?

Yes. Self-hosted LiteLLM uses your own provider API keys, so each provider bills you directly for the tokens you spend through it. Unlike a hosted gateway that consolidates billing, LiteLLM doesn't sit in the payment path — it just routes and translates the calls.

Can LiteLLM automatically fall back to another model if one fails?

Yes. You define fallback chains in the config — an ordered list of backup models. If the primary errors out or hits a rate limit, the router retries the next model in the chain transparently, so the caller still gets a response. Just remember to log when a fallback fires so real failures don't stay hidden.

// In plain English

// Why it matters

// How it works

The SDK: translation inside your process

The proxy: one shared gateway for everyone

// SDK vs proxy: which form do you need?

// Self-hosted LiteLLM vs a hosted gateway

// Common pitfalls

// Going deeper

// FAQ

// Further reading

// Related