AI/TLDR

How to Estimate the Cost of an AI App Before You Build It

Build a back-of-the-envelope model that predicts your AI app's monthly bill before you write production code.

INTERMEDIATE11 MIN READUPDATED 2026-06-13

In plain English

An AI app that calls a language model doesn't run for free. Every question a user asks, and every answer the model writes, is metered and billed by the token — a chunk of text roughly three-quarters of a word long. Send 1,000 tokens in and get 500 back, and you pay for all 1,500, at a rate the provider sets per million tokens. Multiply that by every message, every user, every day, and you get a monthly bill. Cost estimation is the back-of-the-envelope math that predicts that bill before you build the thing.

Estimating AI App Cost — illustration
Estimating AI App Cost — flyaps.com

Think of it like estimating a road trip before you leave. You don't need to drive the route to know the fuel cost — you take the distance, divide by your car's miles-per-gallon, and multiply by the price at the pump. AI cost estimation is the same shape: take the tokens per request, multiply by requests per user, multiply by your number of users, and multiply by the price per token. None of it requires running production code. It requires a few honest assumptions and a spreadsheet.

Why it matters

The number-one way AI products die quietly is a unit economics surprise: the feature works beautifully in the demo, ships to real users, and three weeks later someone notices the model bill is larger than the revenue. A five-minute estimate up front catches that before a single line of production code is written.

  • Pricing the feature. If your AI assistant costs $0.04 per active user per month, a $9/month subscription is comfortable. If it costs $14 per user, the same subscription loses money on every customer. You can only know which world you're in by estimating.
  • Choosing a model. The cheapest and most expensive models a provider offers can differ by 5–25× per token. A good estimate tells you whether you need the flagship model or whether a smaller, cheaper one keeps you profitable — see the modern AI app stack for where the model fits.
  • Killing bad ideas cheaply. Some features are simply not economically viable at the price users will pay. Discovering that in a spreadsheet costs you an afternoon. Discovering it after launch costs you a quarter.
  • Setting limits. Once you know the per-request cost, you can decide where to cap usage — a free tier, a daily message limit, a maximum response length — so a single power user (or a runaway loop) can't run up a four-figure bill overnight.

Crucially, this is a skill you use before you have data. Once the app is live you can measure real usage and stop guessing. The estimate is the bridge that gets you from "I have an idea" to "I know roughly what it will cost to run" without building first.

How it works

Every cost estimate is the same chain of multiplications. You turn product assumptions into a per-request token count, scale that to a per-user cost, then scale to your whole user base. Get each link right and the final number falls out.

Step 1 — Tokens per request (and the input/output split)

A single model call has two token counts, and they are priced differently. Input tokens are everything you send: the system prompt, the conversation history, any retrieved documents, and the user's new message. Output tokens are what the model writes back. Output almost always costs more per token — often 4–5× more — so the two must be estimated separately.

A rough rule of thumb: 1 token ≈ 0.75 words, so 1,000 tokens is about 750 words, and 100 words is about 130 tokens. That's good enough for planning. (When you later want exact counts, providers offer a token-counting endpoint — don't use a different model's tokenizer, it will be wrong.) For a single chatbot turn, a realistic starting estimate might be:

Part of the requestCounts asExample size
System prompt (instructions)Input300 tokens
Conversation history (last few turns)Input1,200 tokens
User's new messageInput100 tokens
Model's replyOutput400 tokens

Step 2 — Cost of one request

Prices are quoted per million tokens (often written per MTok), with separate input and output rates. So the cost of one request is:

the per-request formulatext
cost = (input_tokens  / 1,000,000) × input_price
     + (output_tokens / 1,000,000) × output_price

Take the example above — 1,600 input tokens, 400 output tokens — and a mid-tier model priced at $3 per MTok input and $15 per MTok output (a realistic 2026 rate for a balanced model):

worked example — one chatbot turntext
input:  1,600 / 1,000,000 × $3  = $0.0048
output:   400 / 1,000,000 × $15 = $0.0060
                              total = $0.0108  ≈ 1.1 cents per turn

Step 3 — Scale to a user, then to everyone

Now you need requests per active user per month. Estimate it from the product: a casual user might send 20 messages a month; a daily-driver power user might send 600. Pick a realistic average — say 50 messages per user per month. Then:

scaling uptext
per user / month  = $0.0108 × 50            = $0.54
for 1,000 users   = $0.54  × 1,000         = $540 / month
for 10,000 users  = $0.54  × 10,000        = $5,400 / month

That single chain — tokens → request cost → user cost → fleet cost — is the entire estimate. Everything else in this article is about making each number more honest, or making the final number smaller.

Building the spreadsheet

The estimate lives best in a spreadsheet, because the whole point is to change one assumption and watch the bill move. Set it up as a list of named inputs at the top and a few computed rows below. Here's the same logic as a tiny script you could paste into a notebook — but a spreadsheet works identically and lets non-engineers play with it.

cost_model.py — a back-of-the-envelope projectorpython
# --- Assumptions (the only numbers you edit) ---
input_tokens   = 1_600     # system + history + user message
output_tokens  = 400       # model's reply
input_price    = 3.00      # $ per million input tokens
output_price   = 15.00     # $ per million output tokens
reqs_per_user  = 50        # requests per active user per month
active_users   = 10_000

# --- Derived numbers (never edit these) ---
cost_per_req = (input_tokens  / 1_000_000) * input_price \
             + (output_tokens / 1_000_000) * output_price
cost_per_user  = cost_per_req * reqs_per_user
monthly_bill   = cost_per_user * active_users

print(f"per request : ${cost_per_req:.4f}")
print(f"per user/mo : ${cost_per_user:.2f}")
print(f"monthly bill: ${monthly_bill:,.0f}")

# per request : $0.0108
# per user/mo : $0.54
# monthly bill: $5,400

The real value is the viability flag. Add one more line that compares the per-user cost to what a user pays you. If a user pays $9/month and costs you $0.54 to serve, your model spend is 6% of revenue — healthy. If they cost $7.50 to serve, it's 83% — dead on arrival before you've paid for anything else.

the viability checkpython
price_to_user = 9.00
model_share = cost_per_user / price_to_user
verdict = "OK" if model_share < 0.30 else "RECONSIDER"
print(f"model spend is {model_share:.0%} of revenue -> {verdict}")
# model spend is 6% of revenue -> OK

Levers that change the number

Once the spreadsheet exists, you can test the moves that make the bill smaller. The two biggest are choosing a cheaper model and caching repeated input — both can cut the number several-fold without changing what the user sees.

Model choice

Providers offer a tiered lineup, and the spread is large. As an illustration, a 2026 lineup might price a small model at $1 / $5 per MTok (input / output), a balanced model at $3 / $15, and a flagship at $5 / $25. Running the same 1,600-in / 400-out request through each:

Model tierInput $/MTokOutput $/MTokCost / request10k users × 50 req
Small / fast$1$5$0.0036$1,800 / mo
Balanced$3$15$0.0108$5,400 / mo
Flagship$5$25$0.0180$9,000 / mo

The small model is 5× cheaper than the flagship for the same traffic. The right question is never "which model is best?" but "which is the cheapest model that's good enough for this task?" Many features — classification, short replies, routing — run perfectly on the small tier. Reserve the flagship for the requests that genuinely need it.

Prompt caching

If a large chunk of your input is the same on every request — a long system prompt, a fixed set of instructions, a reference document — most providers let you cache that prefix. A cache read typically costs around one-tenth of the normal input price. When 1,300 of your 1,600 input tokens are a stable prefix, caching them turns most of your input cost into a rounding error.

Going deeper

The simple chain handles a single-call chatbot well. Real systems have a few wrinkles that, if ignored, make an estimate quietly wrong. Here are the ones worth knowing once the basics click.

Agents and multi-step loops multiply everything. A plain chatbot makes one model call per user message. An AI agent that plans, calls tools, reads results, and tries again might make 5–15 calls for a single user request — and each step resends the growing transcript as input. If your feature is agentic, estimate the average number of model calls per task and multiply your per-call cost by it. This is the most common reason a real bill blows past a naive estimate.

RAG inflates input. If you retrieve documents and paste them into the prompt, those documents are input tokens you pay for on every query. Stuffing five 500-token passages into context adds 2,500 input tokens to each request — often dwarfing the user's actual question. Budget retrieval as part of your input count, not an afterthought.

Tokens are not words, and not all text tokenizes equally. The 0.75-words-per-token rule is fine for English prose, but code, JSON, non-English languages, and unusual formatting can tokenize very differently — sometimes far more tokens per character. For a planning estimate the rule is good enough; before launch, run a representative sample of your real prompts through the provider's token-counting endpoint to calibrate.

Two cost types you might forget. Embeddings (turning text into vectors for search) are billed separately and usually cheaply, but at high volume they add up. And if you do background or non-urgent work, many providers offer a batch mode at roughly half price for jobs that can wait — worth a column in your spreadsheet if any of your workload is asynchronous.

The honest limits. An estimate is only as good as its assumptions, and the assumption you'll get most wrong is requests per user — real usage is almost always more skewed than you expect, with a few power users driving most of the cost. So build the high column, watch real usage the moment you launch, and treat the spreadsheet as a living model you update with measured numbers rather than a one-time prophecy. The estimate's job isn't to be exactly right; it's to be right enough to make the build-or-don't decision with your eyes open. From here, the natural next steps are choosing where the app runs (deployment options) and designing around the other cost-shaped constraint, latency.

FAQ

How do I estimate the cost of an AI app before building it?

Multiply four numbers: tokens per request (input plus output), requests per active user per month, the provider's price per token, and your number of active users. Input and output tokens are priced separately, so estimate them apart. Put the assumptions in a spreadsheet so you can change one and watch the projected monthly bill move.

Why are input and output tokens priced differently?

Generating text is more computationally expensive than reading it, so providers charge more for output — often 4–5× the input rate. A request that sends 1,600 input tokens and returns 400 output tokens can have most of its cost come from those 400 output tokens. Always estimate the two separately rather than using one blended rate.

How many tokens is a typical request?

Use the rule of thumb that 1 token is about 0.75 words (1,000 tokens ≈ 750 words). A single chatbot turn might be 1,500–2,000 input tokens (system prompt plus conversation history plus the user's message) and a few hundred output tokens. The conversation history is the part that grows over a session, so it dominates input cost in long chats.

How much can choosing a cheaper model save?

A lot. Within one provider's lineup, the smallest model can be 5× or more cheaper per token than the flagship. For the same traffic, that's the difference between, say, $1,800 and $9,000 a month. The right question is which is the cheapest model that's good enough for the task — many features run fine on a small, fast tier.

Does prompt caching really lower the bill?

Yes, when a large part of your input repeats across requests. Caching a stable prompt prefix (a long system prompt, fixed instructions, a reference document) lets the provider charge roughly one-tenth of the normal input price on cache reads. If most of your input is a fixed prefix, caching can turn the bulk of your input cost into a rounding error.

Why did my real AI bill come out higher than my estimate?

The three usual culprits are: conversation history resent on every turn (input grows each message), agent loops making many model calls per user request, and a few power users sending far more requests than the average you assumed. Build a high-estimate column for power users and multi-step workflows, and recalibrate against measured usage as soon as you launch.

Further reading