In plain English
An LLM API is the doorway your code uses to talk to an AI model that lives on someone else's servers. API stands for Application Programming Interface — a fixed, documented way for one program to ask another program to do something. An LLM API is that idea pointed at a large language model: your program sends some text, and the provider's giant model sends text back.
The everyday analogy is ordering food through a delivery app. You don't walk into the kitchen, buy a stove, or learn to cook. You fill in a structured form — this dish, this address, hold the onions — tap send, and a meal shows up. You never see the kitchen. An LLM API is the same deal: you fill in a structured request — this model, this message, keep the answer short — send it over the internet, and a finished response shows up. The model itself (tens of gigabytes, running on expensive GPUs) stays in the provider's 'kitchen.'
Concretely, talking to an LLM API means sending an HTTP request — the same kind of request your browser makes to load a web page — to a specific web address (a URL endpoint) run by a provider like Anthropic, OpenAI, or Google. You attach your message and a secret key that proves who you are. Seconds later you get back a response containing the model's reply. That round trip is the whole game. Everything else — chatbots, coding assistants, AI agents — is built on top of it.
Why it matters
Frontier LLMs are enormous. Running one means owning or renting racks of high-end GPUs, loading hundreds of gigabytes of model weights, and keeping it all online. That is out of reach for almost every app and every individual. The API solves this: the provider hosts the model once, and millions of developers borrow it through a simple web request. You go from 'I need a data center' to 'I need an internet connection and a key.'
That shift is why AI features showed up everywhere so fast. A small team can add summarization, chat, or classification to their product in an afternoon by calling an API — no machine learning expertise, no servers, no training. The model becomes a building block you call like a payment processor or a maps service. You pay per use instead of per server, so a weekend project and a company serving millions hit the exact same endpoint.
Who should care:
- Developers — the API is now a standard tool in the box, as routine as calling a database. Most things labeled 'AI app' are an LLM API call wrapped in product code.
- Beginners — this is the single highest-leverage skill to learn first. One working API call unlocks chatbots, RAG, and agents, because they are all elaborations on the same request-and-reply.
- Anyone evaluating cost — usage is metered by tokens, so understanding the API is the first step to understanding the bill (see how LLM API pricing works).
What it replaced: before hosted LLM APIs, adding language intelligence to software meant training your own narrow model — collecting data, hiring specialists, waiting months — for each separate task. The API collapsed that into a network call to a single general-purpose model that handles all of those tasks through plain instructions.
How it works
Every LLM API call is one HTTP round trip. Your program builds a request, sends it across the internet to the provider's endpoint, the provider runs the model, and it sends back a response. Here is the full path:
Three pieces make up the request you send. Get these and you understand 90% of every LLM API.
1. The API key
A long secret string (looks like sk-...) that proves the request is yours and tells the provider whose account to bill. You create one in the provider's dashboard and send it in an HTTP header on every request. It is a password — anyone who has it can spend your money — so it lives in an environment variable or secret manager, never hardcoded in your app or committed to git.
2. The messages and roles
Modern LLM APIs are chat-shaped. Instead of one blob of text, you send a list of messages, each tagged with a role that tells the model who is speaking. The three roles you'll meet immediately:
| Role | Who it is | What it carries |
|---|---|---|
system | You, the developer | Standing instructions: persona, rules, format. Often sent as a separate field. |
user | The end user | The actual question or request. |
assistant | The model | Replies — including past turns you send back to give the model memory. |
This list is also how a chatbot remembers the conversation. The model itself is stateless — it forgets everything the instant a call ends. To continue a chat, your code resends the whole history (every prior user and assistant message) on the next call. The growing transcript has to fit inside the model's context window, the maximum amount of text it can read at once.
3. The parameters (and tokens)
Alongside the messages you set a few knobs. The two you'll use constantly: model picks which model answers (a smaller, cheaper model or a larger, smarter one), and max_tokens caps how long the reply can get. A token is a chunk of text — roughly ¾ of a word — and it is the unit the whole system runs on: the model reads tokens, writes tokens, and you are billed per token. The classic optional knob is temperature, which controls randomness — low for precise, repeatable answers, higher for creative variety (temperature explained).
The response comes back as JSON — a structured text format every language can read. Buried in it are two things you care about: the model's reply text, and a usage block reporting how many input and output tokens the call consumed. That usage count is exactly what you're billed on.
Your first API call
Let's make the round trip real. The rawest possible version is curl, a command-line tool that fires an HTTP request straight from your terminal — no code, no SDK. This calls Anthropic's Messages API, the endpoint behind Claude. Notice the three pieces from above: the key in a header, the messages list, and the parameters.
curl https://api.anthropic.com/v1/messages \
-H "x-api-key: $ANTHROPIC_API_KEY" \
-H "anthropic-version: 2023-06-01" \
-H "content-type: application/json" \
-d '{
"model": "claude-sonnet-4-5",
"max_tokens": 200,
"messages": [
{"role": "user", "content": "Explain an LLM API in one sentence."}
]
}'That's the whole protocol. But almost nobody hand-builds requests in production — every major provider ships an SDK (a software library) that wraps the HTTP details so you call the model like a normal function. Same request, far less ceremony:
# pip install anthropic
import os
from anthropic import Anthropic
# Reads the key from an environment variable — never hardcode it
client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=200,
system="You are a concise teacher.", # standing instructions
messages=[
{"role": "user", "content": "Explain an LLM API in one sentence."}
],
)
print(response.content[0].text) # the reply
print(response.usage) # input/output token countsThe SDK turns the model into one function call. Swap to a different provider — OpenAI's openai package, Google's google-genai — and the shape barely changes: a client built with a key, a list of role-tagged messages, a model name, and a token cap. Learn the pattern once and you can call almost any LLM. For a deeper, Claude-specific walkthrough, see the Claude API beginner's guide.
Chat Completions, Messages, and Responses
You'll hear several endpoint names thrown around. They are different providers' takes on the same request-and-reply, so don't let the names intimidate you.
- Chat Completions — the original chat-style endpoint that OpenAI popularized (
/v1/chat/completions). Its message-list shape became a de facto standard that many other providers and open-source tools copied, so it's the most widely recognized name. - Messages — Anthropic's endpoint for Claude (
/v1/messages). Same idea, a few naming differences (for example, the system prompt is its own top-level field rather than a message in the list). - Responses — OpenAI's newer endpoint built for agents and tool use, folding multi-step interactions into one API. The classic Chat Completions style still works alongside it.
What matters for a beginner: under every one of these names is the same skeleton — send role-tagged messages plus parameters to a URL with your key, get JSON back. Once you can read one provider's docs, the rest are variations on a theme. Pick one, build something small, then port it. The mental model transfers cleanly.
Common pitfalls
A handful of mistakes trip up nearly everyone on their first project. Knowing them up front saves hours.
- Leaking the API key. Pasting a key into client-side code, a public repo, or a screenshot exposes it to the world — and bills land on you. Keep keys server-side, in environment variables. Rotate any key you suspect leaked.
- Expecting the model to remember. The API is stateless. If your bot 'forgets' the last message, it's because you didn't resend the conversation history. Memory is your job, by re-including past messages each call.
- Ignoring errors and rate limits. The network can fail, and providers cap how many requests or tokens you can send per minute (a rate limit, often HTTP
429). Real code retries transient failures with a short backoff instead of crashing. - Forgetting outputs cost money too. Both the text you send and the text the model generates are billed. A chatty
max_tokenson millions of calls adds up fast. - Treating replies as guaranteed-valid data. The model returns text, and text can be wrong, off-format, or hallucinated. Validate anything you feed into downstream code.
Going deeper
The basic call is the floor. Real production use leans on four capabilities layered on the same endpoint.
Streaming flips one flag so the response arrives token-by-token instead of all at once — that's the 'typewriter' effect in chat UIs, and it slashes the time the user waits for the first word. Under the hood it uses Server-Sent Events over the same HTTP connection (streaming explained). Function calling (a.k.a. tool use) lets you describe tools your code provides; the model can then ask to call one — look up this order, run this query — and you run it and hand back the result. It's the foundation of every agent (what is function calling).
Structured outputs force the reply to match a schema you define, so you get parseable JSON every time instead of hoping the model formats it right — essential when the output feeds other software. Prompt caching lets the provider remember a large, unchanging chunk of your prompt (a long system prompt, a big document) across calls, so you don't pay full price to re-process it every time — a major cost lever once your prompts grow.
Production concerns go beyond features. Latency matters: bigger models and longer outputs are slower, so app design often routes easy requests to a small fast model and hard ones to a large model. Reliability means handling rate limits, timeouts, and the occasional bad output with retries, fallbacks, and guardrails. Observability — logging prompts, responses, token counts, and latency — is how you debug and control spend at scale; the discipline of running all this in production is LLMOps. And the open frontier is agentic use: chaining many calls, with tools and memory, into systems that pursue multi-step goals. Every one of those systems still bottoms out in the same humble request you just learned — text in, text out, over HTTP.
FAQ
What is an LLM API in simple terms?
It's a web address you send text to, run by a company like Anthropic or OpenAI, that runs a large language model and sends text back. Your code makes an HTTP request with your message and a secret key, and gets the model's reply as a JSON response. You rent the model by the request instead of running it yourself.
How do I call an AI model from code?
Get an API key from a provider's dashboard, install their SDK (such as the anthropic or openai package), create a client with your key, and call a function with a model name and a list of role-tagged messages. The SDK sends the HTTP request and returns the reply. You can also do it raw with curl to see exactly what travels over the wire.
What is a chat completions API?
It's the chat-style LLM endpoint OpenAI popularized, where you send a list of messages tagged with roles (system, user, assistant) instead of one plain prompt. Its shape became a widely copied standard. Anthropic's equivalent for Claude is the Messages API — same idea with minor naming differences.
Do I need to know machine learning to use an LLM API?
No. Using an LLM API is ordinary web programming — build a request, send it, read the JSON response. You don't train or run the model; the provider does. Knowing how an LLM works helps you prompt it well and avoid pitfalls, but the API itself just needs basic coding and an API key.
Is using an LLM API free?
Usually not. Providers bill by tokens — chunks of text — counting both what you send and what the model generates, so longer prompts and longer answers cost more. Many providers offer a small free trial credit, and open models you run yourself avoid per-call fees but need your own hardware.
Does an LLM API remember previous messages?
No, the API is stateless — the model forgets everything once a call finishes. To keep a conversation going, your code must resend the prior messages on each new request. That growing transcript has to fit inside the model's context window.