AI/TLDR

How to Make Your First LLM API Call

Walk through every step of a real LLM API call — authentication, request structure, response parsing, parameter tuning, multi-turn chat, and error handling — with working Python and JavaScript code.

BEGINNER14 MIN READUPDATED 2026-06-12

In plain English

Making an LLM API call is a lot like ordering a taxi through an app. You open the app (install the SDK), enter your payment details once (your API key), type your destination (your message), and tap 'request ride' (send the call). A few seconds later a car arrives (the model's reply). You don't know which server ran the model or how the tokens were generated — none of that matters. You just needed to fill in the right form.

This article is a hands-on walkthrough of that form. By the end you will have working code in Python and JavaScript that authenticates, sends a prompt, reads every useful field in the response, adjusts the key parameters, holds a multi-turn conversation, and recovers gracefully from errors. If you want to understand what an LLM API is before you start typing code, read What Is an LLM API? first.

Why it matters

Reading a tutorial is one thing; shipping a working call is another. Most beginners get stuck at one of four predictable spots: setting up the API key without leaking it, understanding which field in the response actually holds the text, knowing which parameters to touch (and which to ignore), and recovering when the network or the provider misbehaves. This tutorial addresses all four in sequence.

Getting past those four hurdles unlocks almost everything else in AI engineering. RAG is retrieval + an API call. Agents are loops of API calls plus tool use. Evals are batches of API calls scored against criteria. The pattern you learn here scales from a single prototype to a high-volume production system — only the error handling and retry logic get more sophisticated.

How each call travels

Before touching code, here is the full path of one API call so you know exactly what every line of your code corresponds to.

Notice that three things happen before the model ever sees your text: the SDK serializes your Python/JS objects into a JSON body, it attaches your API key as an Authorization header, and it fires an HTTPS POST to the provider's endpoint. The response comes back as JSON too — the SDK deserializes it into objects you can dot-access. You interact only with the top and bottom of this chain.

Step 1 — Set up your environment

You need two things before writing a single line of app code: an API key and the right package installed.

Get an API key

  1. OpenAI — sign in at platform.openai.com, go to API keys, click Create new secret key.
  2. Anthropic — sign in at console.anthropic.com, go to API Keys, click Create Key.
  3. Copy the key immediately — most dashboards show it only once.

Store the key as an environment variable

terminalbash
# macOS / Linux — add to ~/.zshrc or ~/.bashrc for persistence
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."

# Windows (PowerShell)
$env:OPENAI_API_KEY = "sk-..."
$env:ANTHROPIC_API_KEY = "sk-ant-..."

Install the SDK

installbash
# Python
pip install openai           # OpenAI SDK
pip install anthropic         # Anthropic SDK

# JavaScript / TypeScript (npm)
npm install openai
npm install @anthropic-ai/sdk

Step 2 — Send your first call and read the response

Below is the minimal working call for each provider, with every field labelled. Run this once to confirm your key is wired up correctly.

Python — OpenAI

openai_first_call.pypython
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])  # reads env var

response = client.chat.completions.create(
    model="gpt-4o-mini",          # which model to use
    max_tokens=256,               # hard cap on reply length
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user",   "content": "What is the capital of France?"}
    ]
)

# --- Reading the response ---
reply_text  = response.choices[0].message.content   # the model's answer
finish      = response.choices[0].finish_reason     # 'stop' | 'length' | 'content_filter'
prompt_tok  = response.usage.prompt_tokens          # tokens you sent
output_tok  = response.usage.completion_tokens      # tokens the model generated
total_tok   = response.usage.total_tokens           # prompt + completion

print(reply_text)
print(f"Tokens used: {prompt_tok} in / {output_tok} out")

Python — Anthropic (Claude)

anthropic_first_call.pypython
import os
from anthropic import Anthropic

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

response = client.messages.create(
    model="claude-haiku-4-5",     # check docs for current model IDs
    max_tokens=256,
    system="You are a helpful assistant.",  # system is a top-level field in Anthropic
    messages=[
        {"role": "user", "content": "What is the capital of France?"}
    ]
)

# --- Reading the response ---
reply_text  = response.content[0].text              # Anthropic uses .content list, not .choices
stop_reason = response.stop_reason                  # 'end_turn' | 'max_tokens' | 'stop_sequence'
input_tok   = response.usage.input_tokens
output_tok  = response.usage.output_tokens

print(reply_text)
print(f"Tokens used: {input_tok} in / {output_tok} out")

JavaScript — OpenAI

openai_first_call.mjstypescript
import OpenAI from "openai";

const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

const response = await client.chat.completions.create({
  model: "gpt-4o-mini",
  max_tokens: 256,
  messages: [
    { role: "system",  content: "You are a helpful assistant." },
    { role: "user",    content: "What is the capital of France?" }
  ]
});

const replyText = response.choices[0].message.content;
const { prompt_tokens, completion_tokens } = response.usage;

console.log(replyText);
console.log(`Tokens: ${prompt_tokens} in / ${completion_tokens} out`);

Key differences between providers

FieldOpenAIAnthropic
System promptMessage with role: "system" in the arraySeparate system top-level field
Reply text pathresponse.choices[0].message.contentresponse.content[0].text
Stop reason fieldfinish_reason on choices[0]stop_reason on the response root
Token countsusage.prompt_tokens / completion_tokensusage.input_tokens / output_tokens

Step 3 — Tune the key parameters

Most parameters default to sensible values. These are the ones worth knowing on day one.

model

The string ID of the model you want. Smaller models (gpt-4o-mini, claude-haiku-*) are faster and cheaper; larger ones (gpt-4o, claude-sonnet-*) are smarter. Start small. Upgrading is a one-line change. Model IDs change as providers release new versions — always copy the current ID from the provider's docs page, not from a tutorial (including this one).

max_tokens

The hard ceiling on how many tokens the model can generate. Does not make the model produce a long answer — it just stops it from going over. If you set it too low, replies get truncated mid-sentence. Output tokens are billed separately from input tokens and are typically more expensive, so this knob also controls cost.

temperature

Controls randomness on a 0–2 scale (OpenAI) or 0–1 scale (Anthropic). 0 makes the model deterministic — same input always gives the same output, ideal for factual lookups, data extraction, and classification. 1 is the typical creative-writing default. Values above 1 (OpenAI only) produce very diverse, sometimes incoherent output. For most apps, stay between 0 and 1. See temperature explained.

stop sequences

A list of strings that, when the model generates them, immediately halts output. Useful for parsing: if you ask for a numbered list and add stop=["6."], the model won't generate item six even if max_tokens hasn't been reached. This gives you fine-grained control over reply shape without parsing tricks.

parameter_examples.pypython
# Factual, deterministic call — temperature 0
fact_response = client.chat.completions.create(
    model="gpt-4o-mini",
    max_tokens=100,
    temperature=0,          # reproducible answers
    messages=[{"role": "user", "content": "What year was Python created?"}]
)

# Creative call — temperature 0.9
story_response = client.chat.completions.create(
    model="gpt-4o-mini",
    max_tokens=400,
    temperature=0.9,        # varied, imaginative output
    messages=[{"role": "user", "content": "Write the opening line of a sci-fi novel."}]
)

# Stop after the first numbered item
list_response = client.chat.completions.create(
    model="gpt-4o-mini",
    max_tokens=200,
    stop=["2."],            # halts before item 2
    messages=[{"role": "user", "content": "List the planets starting with 1."}]
)

Step 4 — Build a multi-turn conversation

The API is stateless: every call starts from scratch and the model has no memory of previous calls. To create a conversation, you maintain a list of messages in your code and extend it after each turn. The diagram below shows the pattern.

multi_turn.pypython
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Start with the system instruction
history = [
    {"role": "system", "content": "You are a concise Python tutor."}
]

while True:
    user_input = input("You: ")
    if user_input.lower() in ("quit", "exit"):
        break

    # 1. Append the new user message
    history.append({"role": "user", "content": user_input})

    # 2. Send the ENTIRE history
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        max_tokens=512,
        messages=history
    )

    reply = response.choices[0].message.content
    print(f"Assistant: {reply}\n")

    # 3. Append the model's reply so future turns remember it
    history.append({"role": "assistant", "content": reply})

Step 5 — Handle errors reliably

Network calls fail. The table below covers the errors you'll actually encounter and what to do about each one.

Error / HTTP codeWhat it meansWhat to do
401 UnauthorizedAPI key missing, wrong, or expiredCheck env var; regenerate the key in the dashboard
429 Too Many RequestsRate limit or quota exceededWait and retry with exponential backoff; check your tier's limits
500 / 503Provider-side server errorRetry after a short delay; log the request ID in the error for support
400 Bad RequestMalformed JSON, unsupported model ID, or invalid parameterFix the request; read the error message for the specific field
Timeout / network dropConnection dropped mid-callRetry with idempotent logic; log the attempt count

Retry with exponential backoff

The 429 rate-limit error is the one you will hit most often as you scale up. The right response is to wait, then retry — but wait a bit longer each time so you don't hammer the endpoint. Here is a minimal pattern.

retry_with_backoff.pypython
import os
import time
from openai import OpenAI, RateLimitError, APIStatusError

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def call_with_backoff(messages, model="gpt-4o-mini", max_retries=5):
    delay = 1  # start with 1 second
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model=model,
                max_tokens=512,
                messages=messages
            )
        except RateLimitError:
            if attempt == max_retries - 1:
                raise  # give up after max retries
            print(f"Rate limited. Waiting {delay}s before retry {attempt + 1}...")
            time.sleep(delay)
            delay *= 2  # double the wait each time (exponential backoff)
        except APIStatusError as e:
            if e.status_code >= 500:  # server error — safe to retry
                time.sleep(delay)
                delay *= 2
            else:
                raise  # 4xx errors (bad request) won't fix themselves — don't retry

# Usage
response = call_with_backoff(
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

JavaScript error handling

error_handling.mjstypescript
import OpenAI from "openai";

const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

async function callWithRetry(messages, maxRetries = 5) {
  let delay = 1000; // ms
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await client.chat.completions.create({
        model: "gpt-4o-mini",
        max_tokens: 512,
        messages
      });
    } catch (err) {
      if (err.status === 429 || (err.status >= 500 && err.status < 600)) {
        if (attempt === maxRetries - 1) throw err;
        console.log(`Retrying in ${delay}ms (attempt ${attempt + 1})...`);
        await new Promise(r => setTimeout(r, delay));
        delay *= 2;
      } else {
        throw err; // 4xx: bad request, auth — don't retry
      }
    }
  }
}

const response = await callWithRetry([
  { role: "user", content: "Hello!" }
]);
console.log(response.choices[0].message.content);

Going deeper

Once your first call is working, the natural next steps each add one capability to the same pattern.

Streaming flips one flag (stream=True in Python) so the response arrives token-by-token over a Server-Sent Events connection — that is how chat UIs produce the typewriter effect. The response object is replaced by an iterator you loop over. Full walkthrough: LLM streaming explained.

Structured outputs add a response_format parameter (OpenAI) or a tool schema (Anthropic) that forces the model to return JSON matching a shape you define — essential when the output feeds downstream code that needs reliable fields. Function calling goes further: you describe tools your code provides, and the model decides to invoke one, sending back a structured call object you execute locally and return as a result. See what is function calling.

Prompt caching is a cost lever for production apps. If you have a large system prompt or a long document you include on every call, caching tells the provider to keep the computed state of that text so it doesn't reprocess it each time — you pay a fraction of the normal input-token price for cache hits. Embeddings are a different kind of API call entirely: instead of generating text, you get back a vector representation of your input, used for semantic search and RAG. All of these start from the same foundation you have now: a client, a key, and a request.

For production deployments, add LLM observability early: log every request, response, token count, and latency from the start. Debugging a live app without that data is very hard. LLM API pricing explains how to model costs before they surprise you.

FAQ

How do I authenticate with an LLM API?

Create an API key in the provider's dashboard (e.g. platform.openai.com or console.anthropic.com), store it as an environment variable (never hardcoded), and pass it when you create the SDK client: OpenAI(api_key=os.environ['OPENAI_API_KEY']). The SDK automatically sends it as an Authorization header on every request.

How do I read the text from an LLM API response?

For OpenAI, the text is at response.choices[0].message.content. For Anthropic (Claude), it is at response.content[0].text. Always check finish_reason (OpenAI) or stop_reason (Anthropic) too — if it says 'length' or 'max_tokens', the reply was cut off and you may need to raise max_tokens.

What does the temperature parameter do in an LLM API call?

Temperature controls output randomness. Set it near 0 for factual, deterministic answers (same input → same output every time); set it higher (0.7–1.0) for creative or varied responses. For most tasks — summarization, classification, data extraction — temperature 0 is the right default. For brainstorming or creative writing, use 0.7–1.0.

Why does the LLM forget what I said in the previous message?

The API is stateless — each call starts fresh with no memory of earlier calls. To maintain a conversation, your code must collect all previous messages into a list and resend the entire list with every new call. The model reads the full history each time, giving the appearance of memory.

What is a 429 rate limit error and how do I fix it?

A 429 means you've exceeded the provider's request or token rate limit for your account tier. The right fix is to catch the error, wait a few seconds, and retry — doubling the wait time each attempt (exponential backoff). Do not retry immediately in a tight loop, as that will keep triggering the limit. Over time, upgrading your usage tier raises the limits.

How do I make an LLM API call in JavaScript?

Install the openai npm package, create a client with new OpenAI({ apiKey: process.env.OPENAI_API_KEY }), and call await client.chat.completions.create({ model, max_tokens, messages }). The whole pattern is async/await — put your call inside an async function or use top-level await in an ES module (.mjs file or "type": "module" in package.json).

Further reading