How to Build a Chatbot with an LLM API (Step by Step)

Q: How much will it cost to run my chatbot with the OpenAI API?

For light personal use, typically a few cents per day. `gpt-4o-mini` costs $0.15 per million input tokens — a 10-turn conversation uses roughly 2,000 tokens total, around $0.0003. Set a monthly spending cap in the OpenAI dashboard to avoid surprise bills.

Q: Why does my chatbot forget the conversation every time I restart it?

The `history` list lives in memory and is lost when the script stops. To persist conversations across restarts, save each message to a file or a local SQLite database and reload it on startup.

Q: What happens if the conversation history gets too long?

The API will return a `context_length_exceeded` error if the total tokens exceed the model's context window. The simplest fix is to keep only the last 10-20 message pairs in the array you send. For production chatbots, consider summarizing older turns rather than discarding them, so the model retains important context from earlier in the thread.

Go from an empty folder to a working, deployable chatbot that remembers the conversation.

BEGINNER15 MIN READUPDATED 2026-06-12

In plain English

An LLM API is a web endpoint that accepts a list of messages and replies with the model's next response. Building a chatbot on top of it is mostly about writing the loop that grows that list: every time the user types something, you add it to the list, send the whole list to the API, get a reply, add that to the list, and repeat. The model reads the entire thread fresh each time — your code is the thing that keeps the conversation alive between calls.

Think of it like passing a paper notebook back and forth. The model gets the notebook, reads every page written so far, writes its reply on the next blank page, and hands it back. You then hand it the same notebook — plus the new page the user just wrote — and it reads the whole thing again. There is no hidden memory on the server side; the notebook is your history list.

This tutorial walks you through every step: getting an API key, installing the SDK, writing the history loop in Python, designing a system prompt, turning on streaming so replies appear word-by-word, and wiring everything to a minimal browser UI you can share with others.

Why the LLM API mechanics matter

Most beginners start by prompting ChatGPT in the browser and assume building a chatbot is similar — just type something and a reply appears. The moment you make your first direct API call, a different picture emerges. The API knows nothing about your previous messages unless you send them. Each call is billed separately. The system prompt you write at the top of every request is the main knob you have to shape the bot's personality and limits. And the length of your history list directly determines your cost.

Understanding these mechanics unlocks every more advanced pattern: you can't meaningfully design a RAG pipeline, an agent loop, or a customer support bot without first internalizing that the model is stateless and that you control what it sees. The chatbot tutorial is the fastest path to that understanding.

Concept	What it means in practice
Stateless API	Each request is independent — send the full history every time or the model forgets
System prompt	Your persistent instruction sheet, prepended to every request, invisible to the user
Message roles	`system`, `user`, and `assistant` — the three labels the model uses to parse the thread
Token billing	Every token in the history costs money — history grows with every turn
Streaming	Reply tokens arrive one-by-one, making the UI feel instant instead of frozen
Context window	There is a hard cap on total tokens per request — long chats need trimming or summarizing

How the API and history loop work

Every LLM chat API — OpenAI, Anthropic, Google, and the open-source equivalents — accepts a messages array. Each element is an object with a role and content. The API reads the array top to bottom, determines what has been said, and generates the next assistant turn. Your job is to build and maintain that array correctly across multiple user turns.

// One turn of the chatbot loop

User types a messagecaptured by input() or a text fieldAppend to historyrole: user, content: messageBuild API requestsystem prompt + full history arrayPOST to LLM APIe.g. /v1/chat/completionsReceive reply token streamrole: assistantAppend reply to historyready for the next turn

The critical detail is step 3: you send the entire history with every request. If you only sent the latest user message, the model would have no memory of the prior exchange and the conversation would feel broken. This design is what makes the context window relevant — a 128k-token context window means you can send roughly 100,000 words of history before you hit the limit.

The three message roles

OpenAI's chat completions API (and most others that follow the same interface) uses three role values. system is a special setup message that the model treats as persistent rules — it's always the first message in the array and it's never shown to the user. user messages are what the human typed. assistant messages are the model's previous replies, which you captured and stored in your history list.

// Messages array for a two-turn conversation

system"You are a concise assistant. Reply in plain English."user"What is gradient descent?"assistant"Gradient descent is an optimization algorithm that..."user"Can you give a real-world analogy?"

The model reads all four messages and writes the next assistant reply. After it replies, you append that reply to your history list and show it to the user. The list now has five messages. On the next user turn it will have six, and so on.

Building the chatbot step by step

Step 1 — Get an API key

You need an account with a model provider. The two most common choices for beginners are OpenAI (GPT-4o) and Anthropic (Claude). Both offer pay-as-you-go billing with no monthly fee, and both have a free credit grant when you first sign up.

OpenAI — sign up at platform.openai.com, open API Keys, and create a new secret key. Use gpt-4o-mini for this tutorial — it is the cheapest capable model.
Anthropic — sign up at console.anthropic.com, open API Keys, and create a key. Use claude-haiku-4-5 — it is fast and inexpensive.
Either way: store the key in a .env file as OPENAI_API_KEY=sk-... and add .env to .gitignore before your first commit.

Step 2 — Install dependencies

bashbash

python3 -m venv .venv
source .venv/bin/activate      # Windows: .venv\Scripts\activate
pip install openai python-dotenv

Step 3 — Write a system prompt

The system prompt is the most powerful lever you have over your chatbot's behavior. It runs before every conversation and sets the model's persona, tone, scope, and hard limits. A good system prompt for a first project is short and specific: name the role, set the style, and add one or two constraints.

pythonpython

SYSTEM_PROMPT = """
You are a friendly assistant for a small online bookshop.
Answer only questions about books, authors, and reading.
Keep every reply under four sentences.
If a question is off-topic, politely redirect.
"""

Notice what the prompt does: it gives the model an identity (bookshop assistant), a scope boundary (books only), a length constraint (four sentences), and a fallback instruction (redirect off-topic questions). Each of these reduces the chance of a confusing or harmful reply without writing a single line of filter code.

Step 4 — Build the multi-turn history loop

Here is a complete terminal chatbot in Python. Save it as chatbot.py and run it with python chatbot.py. The history list is the bot's memory — it grows with every exchange and travels inside every API request.

pythonpython

from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()  # reads OPENAI_API_KEY from .env
client = OpenAI()

SYSTEM_PROMPT = "You are a concise, friendly assistant. Reply in three sentences or fewer."

history = []   # grows with every turn — THIS is the memory

print("Chatbot ready. Press Ctrl+C to quit.\n")
while True:
    user_input = input("You: ").strip()
    if not user_input:
        continue

    history.append({"role": "user", "content": user_input})

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            *history,   # send the ENTIRE history every turn
        ],
        max_tokens=300,
    )

    reply = response.choices[0].message.content
    print(f"Bot: {reply}\n")
    history.append({"role": "assistant", "content": reply})

Step 5 — Add streaming so replies appear word by word

By default, the API waits until the full reply is ready before sending it. For a 200-word answer this can feel like a frozen screen for several seconds. Streaming fixes that: the API sends tokens as they are generated and your code prints each one immediately. The user sees the reply growing in real time, just like ChatGPT.

pythonpython

import sys

# Replace the client.chat.completions.create() call with this:
stream = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        *history,
    ],
    max_tokens=300,
    stream=True,   # <-- the only change
)

print("Bot: ", end="", flush=True)
full_reply = ""
for chunk in stream:
    token = chunk.choices[0].delta.content
    if token:
        print(token, end="", flush=True)
        full_reply += token
print()  # newline after the reply

history.append({"role": "assistant", "content": full_reply})

The key change is stream=True. The API now returns an iterator of chunks instead of a single response object. Each chunk has a delta.content field that may contain one or a few tokens. You accumulate them into full_reply so you can append the complete text to history at the end of the turn.

Adding a simple browser UI

A terminal chatbot works, but it's hard to share and looks unpolished. The fastest path to a browser-based UI is Streamlit — a Python library that turns a script into a web app with zero HTML or CSS. The full chat interface fits in about 25 lines.

bashbash

pip install streamlit
# create app.py with the code below, then:
streamlit run app.py

pythonpython

import streamlit as st
from openai import OpenAI

client = OpenAI()  # reads OPENAI_API_KEY from environment

SYSTEM_PROMPT = "You are a concise, friendly assistant."

st.title("My Chatbot")

# Persist the history list across browser re-renders using session_state
if "history" not in st.session_state:
    st.session_state.history = []

# Render all past messages
for msg in st.session_state.history:
    with st.chat_message(msg["role"]):
        st.write(msg["content"])

# Chat input at the bottom of the page
if prompt := st.chat_input("Ask me anything..."):
    st.session_state.history.append({"role": "user", "content": prompt})
    with st.chat_message("user"):
        st.write(prompt)

    # Streaming response
    with st.chat_message("assistant"):
        reply_box = st.empty()
        full_reply = ""
        stream = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                *st.session_state.history,
            ],
            max_tokens=400,
            stream=True,
        )
        for chunk in stream:
            token = chunk.choices[0].delta.content or ""
            full_reply += token
            reply_box.write(full_reply)

    st.session_state.history.append({"role": "assistant", "content": full_reply})

A few things to notice about the Streamlit version. st.session_state is Streamlit's way of persisting data between the re-renders that happen each time the user submits a message — without it the history list would reset on every interaction. The st.empty() placeholder is updated in the streaming loop so the reply appears word-by-word in the browser. And st.chat_message renders the familiar chat bubble UI automatically.

Keeping the history from growing too large

After a long conversation the history list can contain hundreds of messages. That inflates every API request, slows the response, and costs more money. The simplest fix is a sliding-window trim: keep only the most recent N turns. A smarter approach is to summarize older turns into a single paragraph and substitute that summary at the top of the history.

pythonpython

MAX_TURNS = 10   # keep at most 10 user+assistant pairs

def trim_history(history: list) -> list:
    """Keep only the last MAX_TURNS exchanges (2 messages per turn)."""
    max_messages = MAX_TURNS * 2
    if len(history) > max_messages:
        return history[-max_messages:]
    return history

# Use it before building the API request:
messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    *trim_history(st.session_state.history),
]

For most personal and small-team chatbots a sliding window of 10-20 turns is more than enough. The user rarely needs the model to remember something said 50 messages ago, and trimming keeps your costs predictable.

Common pitfalls and how to avoid them

Most beginners hit the same set of problems on their first chatbot. Knowing them in advance saves hours of frustration.

Forgetting to append the assistant reply to history

If you print the reply but do not append it to history, the model will have no memory of what it said. On the next turn it will contradict itself or treat every message as the first one. Always do history.append({"role": "assistant", "content": reply}) after every API call.

Sending the system prompt inside the history list

The system message belongs at the top of the messages array on every request. A common mistake is appending it to the history list once during setup — then it appears in the middle of the thread after a few turns, confusing the model. Keep the system message separate from history and always prepend it when building the request.

Running out of context on long conversations

If your total message tokens exceed the model's context window, the API returns an error. gpt-4o-mini has a 128k-token context window, which is large enough that most chatbots will never hit it in practice. Still, implementing the sliding-window trim from the previous section is good hygiene even before you need it.

Not handling API errors

The API can return rate-limit errors (429), server errors (500), or authentication errors (401). Wrap your client.chat.completions.create() call in a try/except block and show the user a friendly message instead of a raw Python traceback. For production use, add exponential backoff on 429 errors.

Error code	Likely cause	Fix
`401`	Invalid or missing API key	Check `OPENAI_API_KEY` in your `.env` file
`429`	Rate limit or quota exceeded	Wait and retry; add a spending cap in the dashboard
`500`	Provider-side outage	Retry with backoff; surface error to user
`context_length_exceeded`	History too long for the model	Trim history before the next request

Going deeper

A working streaming chatbot with a web UI is a solid foundation. Here is what to build next, in roughly the order that becomes useful.

Persisting conversations to a database

Right now the history lives in st.session_state — it disappears when the browser tab closes. To let users continue past conversations, save each message to a database (SQLite works fine for personal projects; Postgres or Supabase for anything multi-user) and reload the relevant thread on startup. Each conversation gets a UUID, messages have a foreign key to it, and your history loader pulls the last N messages for that thread.

Switching to Anthropic Claude

The concepts are identical across providers. Anthropic's SDK separates the system prompt from the messages array at the top level rather than as a role value inside the array, but the loop logic is the same. Swap from openai import OpenAI for import anthropic, change the client initialization and model name, and move your system prompt to the system parameter of client.messages.create().

pythonpython

import anthropic

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY

response = client.messages.create(
    model="claude-haiku-4-5",
    max_tokens=300,
    system=SYSTEM_PROMPT,   # separate from messages
    messages=history,       # same list format: role + content
)
reply = response.content[0].text

Adding tool calls so the bot can take actions

Once the basic chat loop is stable, you can give the model the ability to call functions you define — look up a database, run a web search, or read a file. You describe the available functions in the API request, the model decides when to invoke them, your code executes the call and feeds the result back as a new message. This pattern — called function calling or tool use — is the foundation of every AI agent. It is the natural next step once a chatbot works reliably.

Grounding the bot with your own documents (RAG)

A model trained on public data will make things up when asked about your company, codebase, or documents. The fix is retrieval-augmented generation: store your documents as vector embeddings, retrieve the relevant chunks at query time, and inject them into the system prompt. This is the highest-value upgrade you can make after a chatbot works end to end.

// Natural upgrade path from your first chatbot

Terminal chatbotstateless loop, ~30 linesStreaming + history trimfeels fast, cost controlledBrowser UI (Streamlit)shareable, deployed in minutesPersistent DB storageconversations survive tab closeTool calls or RAGbot can act and knows your data

FAQ

How much will it cost to run my chatbot with the OpenAI API?

For light personal use, typically a few cents per day. gpt-4o-mini costs $0.15 per million input tokens — a 10-turn conversation uses roughly 2,000 tokens total, around $0.0003. Set a monthly spending cap in the OpenAI dashboard to avoid surprise bills.

Why does my chatbot forget the conversation every time I restart it?

The history list lives in memory and is lost when the script stops. To persist conversations across restarts, save each message to a file or a local SQLite database and reload it on startup.

What is the difference between the system prompt and the user message?

The system prompt is a persistent, behind-the-scenes instruction prepended to every request — it sets the bot's persona and rules, and the user never sees it. User messages are what the human types. See the guide on system prompts for the full breakdown.

Can I use this same approach with Anthropic Claude or Google Gemini?

Yes. Every major provider exposes a messages-array API with the same core structure. Anthropic's SDK puts the system prompt in a separate system parameter instead of a role inside the array; otherwise the history loop is identical. You swap the client import and model name, and everything else stays the same.

Why does streaming require a different code pattern than a normal API call?

Without streaming, the API returns a single response object when the full reply is ready. With stream=True, it returns an iterator of chunks — each chunk has a delta.content field that may hold one or a few tokens. You loop over the chunks, accumulate them into a string, and print or display each one as it arrives. At the end of the loop you have the full reply to append to history.

What happens if the conversation history gets too long?

The API will return a context_length_exceeded error if the total tokens exceed the model's context window. The simplest fix is to keep only the last 10-20 message pairs in the array you send. For production chatbots, consider summarizing older turns rather than discarding them, so the model retains important context from earlier in the thread.

// In plain English

// Why the LLM API mechanics matter

// How the API and history loop work

The three message roles

// Building the chatbot step by step

Step 1 — Get an API key

Step 2 — Install dependencies

Step 3 — Write a system prompt

Step 4 — Build the multi-turn history loop

Step 5 — Add streaming so replies appear word by word

// Adding a simple browser UI

Keeping the history from growing too large

// Common pitfalls and how to avoid them

Forgetting to append the assistant reply to history

Sending the system prompt inside the history list

Running out of context on long conversations

Not handling API errors

// Going deeper

Persisting conversations to a database

Switching to Anthropic Claude

Adding tool calls so the bot can take actions

Grounding the bot with your own documents (RAG)

// FAQ

// Further reading

// Related

In plain English

Why the LLM API mechanics matter

How the API and history loop work

Building the chatbot step by step

Adding a simple browser UI

Common pitfalls and how to avoid them

Going deeper

FAQ

Further reading

Related