In plain English
Every call you make to a hosted LLM API has a price, and that price is measured in tokens — small chunks of text, roughly 3-4 characters each (about 0.75 words). You pay for the tokens you send in (the prompt) and the tokens the model writes back (the answer), at two different rates. That's it. The whole bill is just how many tokens × how much each one costs, summed over every request.

Estimating your cost before you build is like pricing a road trip before you leave. You don't need the exact figure to the cent — you need a back-of-envelope number: distance (tokens per request) × number of trips (requests per day) × fuel price (the rate card). Get those three right and you'll know within a reasonable margin whether your feature costs $5 a month or $5,000.
This page is a practical estimation guide, not a pricing explainer. For the rate cards and how providers structure their pricing, see LLM API pricing. Here, the goal is to hand you a formula you can reuse for any feature on any model, plus two fully worked examples.
Why it matters
The number one way LLM projects blow their budget is by skipping this step. A demo that costs pennies for one user looks free. The same demo serving 10,000 users a day, each having a 20-message conversation, can cost more than an engineer's salary — and nobody notices until the invoice arrives.
A five-minute estimate up front answers questions you must answer before writing code:
- Is this feature even viable? If the math says $40,000/month and your product makes $5,000/month, you need a cheaper model, caching, or a different design — and it's far better to learn that now than after launch.
- Which model can I afford? The same feature on a frontier model versus a small fast model can differ 5-10× in cost. The estimate tells you whether the quality gap is worth the price gap.
- Where should I optimize? The formula shows you exactly which lever matters — usually trimming output length or caching a big shared prompt — instead of guessing.
- What do I tell finance? "Roughly $0.012 per user per day, scaling linearly" is a sentence a budget owner can plan around. "I'm not sure, it depends" is not.
The estimate doesn't have to be perfect. It has to be roughly right and built before you commit. A number that's within 2× of reality, produced in advance, beats a precise number you discover after the bill lands.
How it works
The core formula has two halves: the cost of one request, then that cost scaled by traffic.
Step 1 — cost of a single request
Count the input tokens and the output tokens for a typical request, then multiply each by its rate. Rates are almost always quoted per million tokens (written /1M or /MTok), so divide your token counts by 1,000,000 first.
cost_per_request =
(input_tokens / 1,000,000) × input_rate
+ (output_tokens / 1,000,000) × output_rateStep 2 — scale by traffic
Multiply the per-request cost by how many requests you expect. Pick the time window that's easy to reason about (per day is usually clearest), then multiply up to a month.
daily_cost = cost_per_request × requests_per_day
monthly_cost = daily_cost × 30That's the entire mechanism. Everything else — caching, batch discounts, picking a model — is just a multiplier you apply on top of these two steps. The hard part isn't the arithmetic; it's getting honest token counts and an honest request volume.
Two worked examples
Let's run the formula on two real shapes of feature. We'll use published Claude rates as the example rate card — Claude Sonnet at $3/1M input and $15/1M output, and Claude Haiku at $1/1M input and $5/1M output. Substitute your own provider's numbers; the method is identical.
Example A — a customer-support chatbot
Assume each user turn sends a system prompt plus recent history (about 1,500 input tokens) and the model replies with about 250 output tokens. A typical support conversation is 6 turns. You expect 2,000 conversations a day.
| Quantity | Value |
|---|---|
| Input tokens / turn | 1,500 |
| Output tokens / turn | 250 |
| Turns / conversation | 6 |
| Conversations / day | 2,000 |
| Requests / day | 12,000 |
Per-turn cost on Sonnet: (1,500 / 1M × $3) + (250 / 1M × $15) = $0.0045 + $0.00375 = $0.00825. Notice the 250 output tokens cost almost as much as the 1,500 input tokens — that's the output rate at work.
Scale it up: $0.00825 × 6 turns × 2,000 conversations = $99/day, or roughly $2,970/month. Run the same workload on Haiku — (1,500 / 1M × $1) + (250 / 1M × $5) = $0.0015 + $0.00125 = $0.00275/turn — and you land at $33/day, about $990/month. Same feature, one-third the cost, just by changing the model tier.
Example B — a RAG question-answering app
A retrieval-augmented app stuffs retrieved document chunks into the prompt, so input tokens are large. Say 8,000 input tokens (instructions + 6 retrieved chunks + the question) and 600 output tokens per answer, with 5,000 questions a day and no multi-turn history.
| Quantity | Sonnet cost |
|---|---|
| Input: 8,000 tok | 8,000 / 1M × $3 = $0.024 |
| Output: 600 tok | 600 / 1M × $15 = $0.009 |
| Per question | $0.033 |
| 5,000 / day | $165/day |
| Per month (×30) | ≈ $4,950 |
Here the picture flips: input dominates because retrieval bloats the prompt. That tells you exactly where to optimize — retrieve fewer or shorter chunks, or cache the stable instruction block so you stop paying full price for it on every call. See prompt caching for how that works and what it saves.
The levers: caching, batching, and model tier
Once you have a baseline number, three discounts can move it dramatically. Each is just a multiplier on the relevant part of the formula.
- Discounts repeated input tokens
- Cache reads ~10% of input rate
- Best when a big prompt prefix is shared across calls
- Apply to the input half only
- ~50% off both input and output
- For non-urgent, async work
- Results within hours, not seconds
- Apply to the whole request
- Cheaper model = lower rates
- Often 3-10× cheaper
- Trade some quality for cost
- Changes both rates at once
Caching as a multiplier
If 7,000 of your 8,000 RAG input tokens are a fixed instruction block that never changes, caching lets repeated calls read those tokens at roughly one-tenth the input rate. The cached portion of the input cost drops by about 90%. In Example B that turns $0.024 of input cost into roughly $0.0093, cutting the per-question cost from $0.033 to about $0.018 — nearly half. (There's a small one-time write surcharge to populate the cache; ignore it for a rough estimate when calls reuse the prefix many times.)
Batching as a multiplier
If the work doesn't need an instant answer — overnight summarization, bulk classification, generating embeddings for a backlog — a batch API typically charges about half the normal rate for both input and output. That's a flat 0.5× on the whole request cost. For latency-tolerant workloads it's the easiest large saving available.
Common pitfalls
Estimates go wrong in predictable ways. Watch for these.
- Forgetting output costs more. The single most common error. Output tokens are typically 4-5× the input rate, so a chatty model that writes long answers can dominate your bill even with a short prompt. Always price output at its own (higher) rate.
- Ignoring conversation history. In a chat app, every turn re-sends the entire prior conversation as input. Turn 10 might send 10× the tokens of turn 1. If you price only a single turn, you'll undercount a long conversation by a wide margin.
- Counting words as tokens. Tokens are smaller than words (~0.75 words each), so a word count undercounts tokens by about a third. Code, JSON, and other languages tokenize even less efficiently.
- Using the wrong tokenizer. Each provider tokenizes differently. A count from one provider's tool is not valid for another's model. Use the matching token-counting endpoint.
- Forgetting thinking/reasoning tokens. Models that "think" before answering bill those hidden reasoning tokens as output. If your model reasons, your effective output token count is higher than the visible answer suggests.
- Estimating at average, ignoring the tail. A few power users or a runaway agent loop can generate far more requests than the average. Size for a realistic peak, not just the mean, or set hard limits.
Going deeper
The two-step formula gets you a solid baseline. A few refinements matter once you move from estimate to production.
Read usage from the response, then build a real model. Every API response reports the exact input_tokens and output_tokens it billed (and separate counts for cached reads and cache writes when caching is on). Log these from day one. After a week of real traffic you can replace every assumption in your estimate with a measured distribution — average prompt size, average answer length, requests per active user — and your forecast stops being a guess.
Cost per user, not just cost per request. Finance thinks in users, not tokens. Divide your monthly cost by monthly active users to get a per-user figure, then compare it to what each user pays you. If a user costs $0.40/month in tokens and pays $9/month, you have healthy margin; if they cost $12, you have a problem the estimate just caught for you. The keyword many teams search — LLM cost per 1,000 users — is just cost_per_user × 1,000.
Mind the variable, hard-to-predict pieces. Agentic systems that loop and call tools many times per user action can multiply request counts in ways a single-request estimate misses — price the expected number of model calls per task, not per user message. Long-context features get more expensive as the context fills. And if you let users paste arbitrary text, your input size is unbounded unless you cap it.
Choosing the model is itself a cost decision. The cheapest model that meets your quality bar is usually the right one, and the only way to know the bar is to test candidates on your actual task. See how to choose an LLM model for that side of the tradeoff — the cost estimate and the quality test together tell you which model to ship.
Put a ceiling on it. An estimate predicts the expected bill; it doesn't prevent a surprise one. Set a spending limit or budget alert in your provider's console, cap max_tokens so no single response runs away, and rate-limit per user. The estimate tells you where to set those limits — roughly your predicted peak plus a margin.
FAQ
How do I calculate the cost of a single LLM API request?
Take your input tokens divided by 1,000,000, times the input rate, plus your output tokens divided by 1,000,000, times the output rate. Add the two. For example, 1,500 input and 250 output tokens at $3/1M input and $15/1M output costs (1,500/1M × $3) + (250/1M × $15) = $0.0045 + $0.00375 = about $0.008 per request.
Why is output more expensive than input for LLM APIs?
Generating tokens is more computationally expensive than reading them, so providers charge output at a higher rate — often 4-5× the input rate. This means a model that writes long answers can dominate your bill even with a short prompt. Always price output tokens at their own higher rate, never at the input rate.
How many tokens is a typical request?
It depends entirely on your feature. A rough rule is ~0.75 words per token (1,000 words ≈ 1,333 tokens), but that drifts on code, JSON, and non-English text. For a real number, run a representative prompt through your provider's token-counting endpoint rather than guessing from a word count — word counts undercount tokens by about a third.
How do I estimate LLM cost per 1,000 users?
First find your cost per active user: divide your total monthly token cost by your monthly active users. Then multiply by 1,000. So if each user costs $0.012/month in tokens, 1,000 users cost about $12/month. Build the per-user figure from real usage data (the API reports token counts on every response) as soon as you have traffic.
Does prompt caching change my cost estimate?
Yes, significantly, when a large part of your prompt is repeated across calls. Cached input tokens are read at roughly one-tenth of the normal input rate, so a big fixed instruction block becomes about 90% cheaper on repeat calls. It only affects the input side and only helps when the same prefix is reused many times. There's a small one-time write surcharge you can ignore for a rough estimate.
How can I make my estimate more accurate?
Run the smallest real test you can: send ~100 representative requests and read the actual input and output token counts the API returns on each response. Compare to your prediction. If you're off by more than about 2×, find out why — usually it's forgotten conversation history, longer-than-expected output, or hidden reasoning tokens — before you scale up.