In plain English
An LLM API is a web endpoint that accepts a list of messages and replies with the model's next response. Building a chatbot on top of it is mostly about writing the loop that grows that list: every time the user types something, you add it to the list, send the whole list to the API, get a reply, add that to the list, and repeat. The model reads the entire thread fresh each time — your code is the thing that keeps the conversation alive between calls.
Think of it like passing a paper notebook back and forth. The model gets the notebook, reads every page written so far, writes its reply on the next blank page, and hands it back. You then hand it the same notebook — plus the new page the user just wrote — and it reads the whole thing again. There is no hidden memory on the server side; the notebook is your history list.
This tutorial walks you through every step: getting an API key, installing the SDK, writing the history loop in Python, designing a system prompt, turning on streaming so replies appear word-by-word, and wiring everything to a minimal browser UI you can share with others.
Why the LLM API mechanics matter
Most beginners start by prompting ChatGPT in the browser and assume building a chatbot is similar — just type something and a reply appears. The moment you make your first direct API call, a different picture emerges. The API knows nothing about your previous messages unless you send them. Each call is billed separately. The system prompt you write at the top of every request is the main knob you have to shape the bot's personality and limits. And the length of your history list directly determines your cost.
Understanding these mechanics unlocks every more advanced pattern: you can't meaningfully design a RAG pipeline, an agent loop, or a customer support bot without first internalizing that the model is stateless and that you control what it sees. The chatbot tutorial is the fastest path to that understanding.
| Concept | What it means in practice |
|---|---|
| Stateless API | Each request is independent — send the full history every time or the model forgets |
| System prompt | Your persistent instruction sheet, prepended to every request, invisible to the user |
| Message roles | system, user, and assistant — the three labels the model uses to parse the thread |
| Token billing | Every token in the history costs money — history grows with every turn |
| Streaming | Reply tokens arrive one-by-one, making the UI feel instant instead of frozen |
| Context window | There is a hard cap on total tokens per request — long chats need trimming or summarizing |
How the API and history loop work
Every LLM chat API — OpenAI, Anthropic, Google, and the open-source equivalents — accepts a messages array. Each element is an object with a role and content. The API reads the array top to bottom, determines what has been said, and generates the next assistant turn. Your job is to build and maintain that array correctly across multiple user turns.
The critical detail is step 3: you send the entire history with every request. If you only sent the latest user message, the model would have no memory of the prior exchange and the conversation would feel broken. This design is what makes the context window relevant — a 128k-token context window means you can send roughly 100,000 words of history before you hit the limit.
The three message roles
OpenAI's chat completions API (and most others that follow the same interface) uses three role values. system is a special setup message that the model treats as persistent rules — it's always the first message in the array and it's never shown to the user. user messages are what the human typed. assistant messages are the model's previous replies, which you captured and stored in your history list.
The model reads all four messages and writes the next assistant reply. After it replies, you append that reply to your history list and show it to the user. The list now has five messages. On the next user turn it will have six, and so on.
Building the chatbot step by step
Step 1 — Get an API key
You need an account with a model provider. The two most common choices for beginners are OpenAI (GPT-4o) and Anthropic (Claude). Both offer pay-as-you-go billing with no monthly fee, and both have a free credit grant when you first sign up.
- OpenAI — sign up at platform.openai.com, open API Keys, and create a new secret key. Use
gpt-4o-minifor this tutorial — it is the cheapest capable model. - Anthropic — sign up at console.anthropic.com, open API Keys, and create a key. Use
claude-haiku-4-5— it is fast and inexpensive. - Either way: store the key in a
.envfile asOPENAI_API_KEY=sk-...and add.envto.gitignorebefore your first commit.
Step 2 — Install dependencies
python3 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install openai python-dotenvStep 3 — Write a system prompt
The system prompt is the most powerful lever you have over your chatbot's behavior. It runs before every conversation and sets the model's persona, tone, scope, and hard limits. A good system prompt for a first project is short and specific: name the role, set the style, and add one or two constraints.
SYSTEM_PROMPT = """
You are a friendly assistant for a small online bookshop.
Answer only questions about books, authors, and reading.
Keep every reply under four sentences.
If a question is off-topic, politely redirect.
"""Notice what the prompt does: it gives the model an identity (bookshop assistant), a scope boundary (books only), a length constraint (four sentences), and a fallback instruction (redirect off-topic questions). Each of these reduces the chance of a confusing or harmful reply without writing a single line of filter code.
Step 4 — Build the multi-turn history loop
Here is a complete terminal chatbot in Python. Save it as chatbot.py and run it with python chatbot.py. The history list is the bot's memory — it grows with every exchange and travels inside every API request.
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv() # reads OPENAI_API_KEY from .env
client = OpenAI()
SYSTEM_PROMPT = "You are a concise, friendly assistant. Reply in three sentences or fewer."
history = [] # grows with every turn — THIS is the memory
print("Chatbot ready. Press Ctrl+C to quit.\n")
while True:
user_input = input("You: ").strip()
if not user_input:
continue
history.append({"role": "user", "content": user_input})
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
*history, # send the ENTIRE history every turn
],
max_tokens=300,
)
reply = response.choices[0].message.content
print(f"Bot: {reply}\n")
history.append({"role": "assistant", "content": reply})Step 5 — Add streaming so replies appear word by word
By default, the API waits until the full reply is ready before sending it. For a 200-word answer this can feel like a frozen screen for several seconds. Streaming fixes that: the API sends tokens as they are generated and your code prints each one immediately. The user sees the reply growing in real time, just like ChatGPT.
import sys
# Replace the client.chat.completions.create() call with this:
stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
*history,
],
max_tokens=300,
stream=True, # <-- the only change
)
print("Bot: ", end="", flush=True)
full_reply = ""
for chunk in stream:
token = chunk.choices[0].delta.content
if token:
print(token, end="", flush=True)
full_reply += token
print() # newline after the reply
history.append({"role": "assistant", "content": full_reply})The key change is stream=True. The API now returns an iterator of chunks instead of a single response object. Each chunk has a delta.content field that may contain one or a few tokens. You accumulate them into full_reply so you can append the complete text to history at the end of the turn.
Adding a simple browser UI
A terminal chatbot works, but it's hard to share and looks unpolished. The fastest path to a browser-based UI is Streamlit — a Python library that turns a script into a web app with zero HTML or CSS. The full chat interface fits in about 25 lines.
pip install streamlit
# create app.py with the code below, then:
streamlit run app.pyimport streamlit as st
from openai import OpenAI
client = OpenAI() # reads OPENAI_API_KEY from environment
SYSTEM_PROMPT = "You are a concise, friendly assistant."
st.title("My Chatbot")
# Persist the history list across browser re-renders using session_state
if "history" not in st.session_state:
st.session_state.history = []
# Render all past messages
for msg in st.session_state.history:
with st.chat_message(msg["role"]):
st.write(msg["content"])
# Chat input at the bottom of the page
if prompt := st.chat_input("Ask me anything..."):
st.session_state.history.append({"role": "user", "content": prompt})
with st.chat_message("user"):
st.write(prompt)
# Streaming response
with st.chat_message("assistant"):
reply_box = st.empty()
full_reply = ""
stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
*st.session_state.history,
],
max_tokens=400,
stream=True,
)
for chunk in stream:
token = chunk.choices[0].delta.content or ""
full_reply += token
reply_box.write(full_reply)
st.session_state.history.append({"role": "assistant", "content": full_reply})A few things to notice about the Streamlit version. st.session_state is Streamlit's way of persisting data between the re-renders that happen each time the user submits a message — without it the history list would reset on every interaction. The st.empty() placeholder is updated in the streaming loop so the reply appears word-by-word in the browser. And st.chat_message renders the familiar chat bubble UI automatically.
Keeping the history from growing too large
After a long conversation the history list can contain hundreds of messages. That inflates every API request, slows the response, and costs more money. The simplest fix is a sliding-window trim: keep only the most recent N turns. A smarter approach is to summarize older turns into a single paragraph and substitute that summary at the top of the history.
MAX_TURNS = 10 # keep at most 10 user+assistant pairs
def trim_history(history: list) -> list:
"""Keep only the last MAX_TURNS exchanges (2 messages per turn)."""
max_messages = MAX_TURNS * 2
if len(history) > max_messages:
return history[-max_messages:]
return history
# Use it before building the API request:
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
*trim_history(st.session_state.history),
]For most personal and small-team chatbots a sliding window of 10-20 turns is more than enough. The user rarely needs the model to remember something said 50 messages ago, and trimming keeps your costs predictable.
Common pitfalls and how to avoid them
Most beginners hit the same set of problems on their first chatbot. Knowing them in advance saves hours of frustration.
Forgetting to append the assistant reply to history
If you print the reply but do not append it to history, the model will have no memory of what it said. On the next turn it will contradict itself or treat every message as the first one. Always do history.append({"role": "assistant", "content": reply}) after every API call.
Sending the system prompt inside the history list
The system message belongs at the top of the messages array on every request. A common mistake is appending it to the history list once during setup — then it appears in the middle of the thread after a few turns, confusing the model. Keep the system message separate from history and always prepend it when building the request.
Running out of context on long conversations
If your total message tokens exceed the model's context window, the API returns an error. gpt-4o-mini has a 128k-token context window, which is large enough that most chatbots will never hit it in practice. Still, implementing the sliding-window trim from the previous section is good hygiene even before you need it.
Not handling API errors
The API can return rate-limit errors (429), server errors (500), or authentication errors (401). Wrap your client.chat.completions.create() call in a try/except block and show the user a friendly message instead of a raw Python traceback. For production use, add exponential backoff on 429 errors.
| Error code | Likely cause | Fix |
|---|---|---|
401 | Invalid or missing API key | Check OPENAI_API_KEY in your .env file |
429 | Rate limit or quota exceeded | Wait and retry; add a spending cap in the dashboard |
500 | Provider-side outage | Retry with backoff; surface error to user |
context_length_exceeded | History too long for the model | Trim history before the next request |
Going deeper
A working streaming chatbot with a web UI is a solid foundation. Here is what to build next, in roughly the order that becomes useful.
Persisting conversations to a database
Right now the history lives in st.session_state — it disappears when the browser tab closes. To let users continue past conversations, save each message to a database (SQLite works fine for personal projects; Postgres or Supabase for anything multi-user) and reload the relevant thread on startup. Each conversation gets a UUID, messages have a foreign key to it, and your history loader pulls the last N messages for that thread.
Switching to Anthropic Claude
The concepts are identical across providers. Anthropic's SDK separates the system prompt from the messages array at the top level rather than as a role value inside the array, but the loop logic is the same. Swap from openai import OpenAI for import anthropic, change the client initialization and model name, and move your system prompt to the system parameter of client.messages.create().
import anthropic
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=300,
system=SYSTEM_PROMPT, # separate from messages
messages=history, # same list format: role + content
)
reply = response.content[0].textAdding tool calls so the bot can take actions
Once the basic chat loop is stable, you can give the model the ability to call functions you define — look up a database, run a web search, or read a file. You describe the available functions in the API request, the model decides when to invoke them, your code executes the call and feeds the result back as a new message. This pattern — called function calling or tool use — is the foundation of every AI agent. It is the natural next step once a chatbot works reliably.
Grounding the bot with your own documents (RAG)
A model trained on public data will make things up when asked about your company, codebase, or documents. The fix is retrieval-augmented generation: store your documents as vector embeddings, retrieve the relevant chunks at query time, and inject them into the system prompt. This is the highest-value upgrade you can make after a chatbot works end to end.
FAQ
How much will it cost to run my chatbot with the OpenAI API?
For light personal use, typically a few cents per day. gpt-4o-mini costs $0.15 per million input tokens — a 10-turn conversation uses roughly 2,000 tokens total, around $0.0003. Set a monthly spending cap in the OpenAI dashboard to avoid surprise bills.
Why does my chatbot forget the conversation every time I restart it?
The history list lives in memory and is lost when the script stops. To persist conversations across restarts, save each message to a file or a local SQLite database and reload it on startup.
What is the difference between the system prompt and the user message?
The system prompt is a persistent, behind-the-scenes instruction prepended to every request — it sets the bot's persona and rules, and the user never sees it. User messages are what the human types. See the guide on system prompts for the full breakdown.
Can I use this same approach with Anthropic Claude or Google Gemini?
Yes. Every major provider exposes a messages-array API with the same core structure. Anthropic's SDK puts the system prompt in a separate system parameter instead of a role inside the array; otherwise the history loop is identical. You swap the client import and model name, and everything else stays the same.
Why does streaming require a different code pattern than a normal API call?
Without streaming, the API returns a single response object when the full reply is ready. With stream=True, it returns an iterator of chunks — each chunk has a delta.content field that may hold one or a few tokens. You loop over the chunks, accumulate them into a string, and print or display each one as it arrives. At the end of the loop you have the full reply to append to history.
What happens if the conversation history gets too long?
The API will return a context_length_exceeded error if the total tokens exceed the model's context window. The simplest fix is to keep only the last 10-20 message pairs in the array you send. For production chatbots, consider summarizing older turns rather than discarding them, so the model retains important context from earlier in the thread.