In plain English
"Fine-tuning is expensive" and "fine-tuning is cheap" are both correct — they just describe different points on a very wide spectrum. A weekend experiment using LoRA on a rented GPU can cost less than a cup of coffee. A full fine-tune of a 70B parameter model with human-annotated data and production serving easily crosses six figures. The number that matters is your number, and it depends on four variables: which method you use, how big the base model is, where you run the compute, and whether you factor in everything beyond the training job itself.

Think of it like renovating a house. The materials (compute) are just one line on the invoice. Labour (engineering time), design (dataset creation), and ongoing running costs (inference serving) often dwarf the raw materials cost — and they're the ones that surprise people. This article breaks every line item apart so you can estimate the real number before you commit.
Why the cost question matters
Fine-tuning is a capital decision. Unlike a prompt change — which you can roll back in seconds at zero cost — a fine-tuning run consumes time, money, and engineering attention up front and produces an artifact that will need maintenance. Getting the budget estimate wrong in either direction is costly: underestimate and you run out of runway mid-project; overestimate and you kill a project that would have been cheap to prove out.
The cost question also determines which fine-tuning approach makes sense. If your budget is under $100, full fine-tuning of anything larger than a 7B model is off the table — but LoRA or a hosted API call is well within reach. If your budget is $10,000+, you unlock larger models, more epochs, and the redundant runs needed to tune hyperparameters properly. Knowing the cost landscape first lets you choose the method that fits your budget rather than discovering the mismatch after the fact.
- Choose the right method. LoRA/QLoRA, full fine-tuning, and hosted APIs have 10-100x cost differences for the same model size.
- Avoid surprise bills. Serving a fine-tuned model on a dedicated GPU endpoint can cost more per month than the training run cost to produce it.
- Plan for iteration. Real fine-tuning projects run 3-10 experiments before finding a winning checkpoint. Each run adds to the total.
- Compare to alternatives. Sometimes RAG, better prompting, or a different base model is cheaper than fine-tuning — but you can only see that with real numbers in hand.
How the cost is built up
A complete fine-tuning budget has four layers. The compute layer gets most of the attention, but the other three are where projects actually go over budget.
Layer 1: Compute
GPU time is priced by the hour and varies enormously by hardware, cloud provider, and whether you're willing to use spot/interruptible instances. The three hardware tiers relevant to fine-tuning are the RTX 4090 (consumer, 24 GB VRAM — good for small models with QLoRA), the A100 80GB (data-centre, the workhorse for 7B-13B full fine-tunes and 70B LoRA), and the H100 80GB (fastest training, best for anything above 30B or tight deadlines).
| GPU | VRAM | RunPod (spot) | Lambda Labs | Vast.ai (low) |
|---|---|---|---|---|
| RTX 4090 | 24 GB | ~$0.34/hr | N/A | ~$0.16/hr |
| A100 80GB | 80 GB | ~$1.30/hr | ~$1.29/hr | ~$0.67/hr |
| H100 SXM | 80 GB | ~$2.69/hr | ~$4.29/hr | ~$1.87/hr |
Layer 2: Data
Data preparation is the most underestimated cost. Assembling 500–2,000 high-quality input/output training pairs — cleaning raw text, writing output examples, reviewing for correctness — typically takes 300–500 person-hours for a first dataset. At $20–50/hr for an engineer's time that is $6,000–$25,000 of hidden cost before a single GPU spins up. If you outsource annotation, professional labellers charge $0.50–$5.00 per example depending on complexity, plus a 15–20% QA overhead.
Layer 3: Engineering
Real fine-tuning projects involve experimentation: you run with different learning rates, epoch counts, and dataset mixes before finding a checkpoint that passes your evals. Industry estimates suggest 3–10 training runs per successful fine-tune, meaning the compute cost you budget for one run should be multiplied by 3–5 to cover the full project. Add evaluation tooling, integration work, and deployment, and the engineering overhead often exceeds $4,000–$12,000 per project even at a small company.
Layer 4: Serving
Training is a one-time cost; serving is a recurring one. A self-hosted 7B model on an L40S GPU costs roughly $1–4/hour to keep running, or $720–$2,880/month at 100% utilisation. A 70B model needs 2–4× A100s, pushing serving costs to $2,000–$6,000/month. Hosted fine-tuning APIs (OpenAI, Together AI) bill inference per token, which scales with usage rather than time — often cheaper below ~1M tokens/day but more expensive at high throughput.
Method-by-method cost comparison
The cheapest way to fine-tune a model is not the same as the best way for every use case. Here is how the three main routes compare on the compute layer alone, using a 7B parameter base model as the reference point.
Route A: LoRA / QLoRA on a rented GPU
QLoRA (quantised LoRA) lets you fine-tune a 7B–8B model on a single A100 80GB in 4–8 hours, training only a small set of low-rank adapter matrices rather than all weights. The GPU cost for a single run is typically $5–$20 — often under $10 on spot instances. A 13B model on two A100s runs $20–$40. A 70B model via QLoRA on a single H100 takes 12–24 hours, pushing costs to $30–$60 per run. These are the lowest-cost numbers in fine-tuning; the real project cost including data and iteration will be higher, but compute alone stays in the tens-of-dollars range.
Route B: Hosted fine-tuning APIs
OpenAI, Together AI, and similar services handle infrastructure for you and charge per training token. OpenAI's GPT-4o fine-tuning is priced at $25 per million training tokens; GPT-4o-mini is dramatically cheaper at roughly $3 per million training tokens. Together AI charges approximately $0.48 per million training tokens for models up to 16B parameters — among the cheapest hosted options for open-source models. A typical fine-tuning dataset of 100,000 tokens trained for 3 epochs consumes 300,000 training tokens; at Together AI rates that is under $0.15, while GPT-4o-mini would cost ~$0.90 and GPT-4o ~$7.50.
| Provider & Model | Training cost | 300K-token run (3 epochs on 100K tokens) | Notes |
|---|---|---|---|
| Together AI (≤16B models) | $0.48 / M tokens | ~$0.15 | Open-source models, Llama / Mistral / Qwen |
| OpenAI GPT-4o-mini | ~$3 / M tokens | ~$0.90 | Managed; inference billed separately |
| OpenAI GPT-4o | $25 / M tokens | ~$7.50 | Managed; inference billed separately |
| OpenAI GPT-4.1 | ~$25 / M tokens | ~$7.50 | Larger capability ceiling |
The catch with hosted APIs is inference pricing after training. OpenAI charges a premium for fine-tuned model inference — GPT-4o fine-tuned inference runs $3.75 per million input tokens and $15 per million output tokens, compared to $2.50/$10 for the base model. At high throughput those premiums accumulate quickly.
Route C: Full fine-tuning on multi-GPU clusters
Full fine-tuning updates every weight in the model. For a 7B model this requires at least 40–80 GB of GPU VRAM across multiple GPUs when you factor in optimizer states and activations. An 8× A100 cluster running a full fine-tune of a 7B model for 24–48 hours costs $250–$500 for the compute alone. A 70B model demands substantially more: 8× H100 SXM for 3–7 days puts compute at $4,000–$12,000 per run. Factor in the iteration multiplier and a successful full fine-tune of a 70B model routinely reaches $30,000–$60,000 in compute before data and engineering costs.
Real budget scenarios
Abstract numbers are hard to internalise. Here are four representative scenarios that correspond to actual project types, with all four cost layers filled in.
| Scenario | Compute | Data | Engineering | Serving (monthly) | Total (first month) |
|---|---|---|---|---|---|
| Weekend LoRA experiment (7B, personal project) | $10–30 | $0 (existing data) | $0 (you) | $0 (local GPU) | $10–30 |
| Startup SFT (7B, hosted API, 2K examples) | $1–10 | $2,000–5,000 (contractor) | $3,000–6,000 | $500–1,500 (API tokens) | $5,500–12,500 |
| Mid-market LoRA (13B, rented cluster, 5K examples) | $200–600 (5 runs) | $5,000–15,000 | $8,000–15,000 | $1,500–4,000 (self-hosted) | $15,000–35,000 |
| Enterprise full fine-tune (70B, on-prem/cloud) | $30,000–60,000 | $30,000–80,000 | $20,000–50,000 | $5,000–15,000 | $85,000–205,000 |
The weekend experiment is the only scenario where compute dominates. In every other case, data preparation and engineering time are the majority of the budget. This is why "fine-tuning costs $10" and "fine-tuning costs $100,000" are both heard in the same industry conversation: the $10 quote is compute-only, the $100,000 quote is a full production project.
Going deeper
Once you have a handle on the basic cost landscape, a few advanced considerations become relevant as you scale.
Spot vs on-demand: when interruptibility is worth it
Spot (interruptible) instances on marketplaces like Vast.ai can be 50–80% cheaper than on-demand, but they can be reclaimed mid-run. For training jobs under 4 hours on a single GPU this is usually an acceptable tradeoff — checkpoint every 20 minutes and a restart costs you little. For 48-hour multi-node runs the interruption risk compounds, and a mid-run failure on a cluster job can corrupt the checkpoint state. Many teams use spot for early-stage experiments and switch to reserved capacity for their final production run.
Cost per quality point: the real optimization target
Raw training cost is not the right optimisation target. What matters is cost per unit of quality improvement relative to your baseline (prompting, RAG, or a different model). A $500 fine-tune that moves your task accuracy from 72% to 89% is far better value than a $50 fine-tune that moves it to 74%. Track your eval metric per dollar spent across runs. The run that looks expensive in isolation is often the cheapest path to a given quality level.
When to choose RAG instead
If the goal is injecting factual knowledge — product catalogs, internal documentation, live data — RAG is almost always cheaper and more maintainable than fine-tuning. A retrieval pipeline over a vector database costs a fraction of a fine-tune and updates instantly when your data changes. Fine-tuning earns its cost when the goal is changing behaviour (style, format, reasoning patterns, task specialisation) rather than knowledge.
Merging LoRA adapters to eliminate serving overhead
A LoRA adapter is a small delta on top of the base model. Most serving frameworks let you merge the adapter weights back into the base model after training, producing a single model file with no runtime overhead from the adapter. This means you can fine-tune cheaply with LoRA, then serve the merged model at the same speed and cost as the base model — the best of both worlds. The merge operation takes minutes and costs nothing. Always merge before deploying to a production endpoint.
FAQ
How much does it cost to fine-tune a 7B model like Llama?
With QLoRA on a single rented A100 80GB, a full fine-tuning run on a 7B model typically takes 4–8 hours and costs $5–$15 in compute on marketplaces like RunPod or Vast.ai. Using a hosted API like Together AI, training on 100,000 tokens for 3 epochs costs under $0.20. The limiting factor is rarely compute — it is data preparation and the number of experimental runs needed to reach a good result.
Is fine-tuning through OpenAI's API cheaper than renting a GPU?
For small datasets and one-off experiments, OpenAI's API is often cheaper because there is no minimum billing and no infrastructure to manage. For large datasets (>10M training tokens) or frequent re-training, renting GPUs and running open-source fine-tuning tooling usually wins on raw compute cost. The right answer depends on your volume, how much you value convenience, and whether you need the specific capabilities of a GPT model.
What is the cheapest way to fine-tune an LLM?
The cheapest route is QLoRA on a spot GPU instance — a 7B model on a single consumer GPU (RTX 4090 at ~$0.16/hr on Vast.ai) costs as little as $2–5 for a short training run on an existing clean dataset. The total budget for a minimal experiment, including data you already have, can genuinely be under $20. Costs rise sharply when you need professional annotation, multiple experiment runs, or production serving.
What does it cost to serve a fine-tuned open-source model in production?
Serving a 7B model on a dedicated A100 80GB endpoint costs roughly $1–1.30/hr, or $720–$950/month at full utilisation on providers like Lambda Labs or RunPod. A 70B model needs 2–4× A100s, pushing monthly serving costs to $2,000–$5,000+. If your request volume is low, serverless GPU inference (billed per token rather than per hour) is much cheaper — many providers offer this for popular open-source architectures.
How much of a fine-tuning budget should I allocate to data preparation?
Industry experience suggests data preparation consumes 20–40% of the total fine-tuning budget on average, and often more on the first project when processes aren't established. If you are building a dataset from scratch with professional annotation, budget at least $0.50–$5.00 per example plus QA overhead. A 1,000-example dataset with moderate complexity can easily cost $2,000–$5,000 to prepare properly.
Does fine-tuning cost more than just using a larger base model with better prompting?
Often, yes — at least up front. A more capable base model (e.g. upgrading from GPT-4o-mini to GPT-4o) can close a quality gap that would otherwise require fine-tuning, with zero training cost. Before committing to a fine-tuning project, run a benchmark comparing your target task on the cheaper model with fine-tuning versus the more expensive base model with good prompting. The upgrade path is frequently cheaper and faster to iterate on.