How Much Does Fine-Tuning Cost? A Realistic Budget Breakdown

Get realistic dollar figures for every fine-tuning route — rented GPUs, hosted APIs, and full runs — including the hidden costs of data and serving.

INTERMEDIATE12 MIN READUPDATED 2026-06-12

In plain English

"Fine-tuning is expensive" and "fine-tuning is cheap" are both correct — they just describe different points on a very wide spectrum. A weekend experiment using LoRA on a rented GPU can cost less than a cup of coffee. A full fine-tune of a 70B parameter model with human-annotated data and production serving easily crosses six figures. The number that matters is your number, and it depends on four variables: which method you use, how big the base model is, where you run the compute, and whether you factor in everything beyond the training job itself.

Much Does Fine-Tuning Cost — diagram — Much Does Fine-Tuning Cost — medium.com

Think of it like renovating a house. The materials (compute) are just one line on the invoice. Labour (engineering time), design (dataset creation), and ongoing running costs (inference serving) often dwarf the raw materials cost — and they're the ones that surprise people. This article breaks every line item apart so you can estimate the real number before you commit.

Why the cost question matters

Fine-tuning is a capital decision. Unlike a prompt change — which you can roll back in seconds at zero cost — a fine-tuning run consumes time, money, and engineering attention up front and produces an artifact that will need maintenance. Getting the budget estimate wrong in either direction is costly: underestimate and you run out of runway mid-project; overestimate and you kill a project that would have been cheap to prove out.

The cost question also determines which fine-tuning approach makes sense. If your budget is under $100, full fine-tuning of anything larger than a 7B model is off the table — but LoRA or a hosted API call is well within reach. If your budget is $10,000+, you unlock larger models, more epochs, and the redundant runs needed to tune hyperparameters properly. Knowing the cost landscape first lets you choose the method that fits your budget rather than discovering the mismatch after the fact.

Choose the right method. LoRA/QLoRA, full fine-tuning, and hosted APIs have 10-100x cost differences for the same model size.
Avoid surprise bills. Serving a fine-tuned model on a dedicated GPU endpoint can cost more per month than the training run cost to produce it.
Plan for iteration. Real fine-tuning projects run 3-10 experiments before finding a winning checkpoint. Each run adds to the total.
Compare to alternatives. Sometimes RAG, better prompting, or a different base model is cheaper than fine-tuning — but you can only see that with real numbers in hand.

How the cost is built up

A complete fine-tuning budget has four layers. The compute layer gets most of the attention, but the other three are where projects actually go over budget.

// The four cost layers of a fine-tuning project

Compute — GPU hours for the training jobmost visible; varies 10-100x by methodData — curation, cleaning, and annotation20-40% of total budget in many projectsEngineering — iteration, evals, and deploymentoften the largest single line itemServing — inference at production scaleongoing; can dwarf the one-time training cost

Layer 1: Compute

GPU time is priced by the hour and varies enormously by hardware, cloud provider, and whether you're willing to use spot/interruptible instances. The three hardware tiers relevant to fine-tuning are the RTX 4090 (consumer, 24 GB VRAM — good for small models with QLoRA), the A100 80GB (data-centre, the workhorse for 7B-13B full fine-tunes and 70B LoRA), and the H100 80GB (fastest training, best for anything above 30B or tight deadlines).

GPU	VRAM	RunPod (spot)	Lambda Labs	Vast.ai (low)
RTX 4090	24 GB	~$0.34/hr	N/A	~$0.16/hr
A100 80GB	80 GB	~$1.30/hr	~$1.29/hr	~$0.67/hr
H100 SXM	80 GB	~$2.69/hr	~$4.29/hr	~$1.87/hr

Layer 2: Data

Data preparation is the most underestimated cost. Assembling 500–2,000 high-quality input/output training pairs — cleaning raw text, writing output examples, reviewing for correctness — typically takes 300–500 person-hours for a first dataset. At $20–50/hr for an engineer's time that is $6,000–$25,000 of hidden cost before a single GPU spins up. If you outsource annotation, professional labellers charge $0.50–$5.00 per example depending on complexity, plus a 15–20% QA overhead.

Layer 3: Engineering

Real fine-tuning projects involve experimentation: you run with different learning rates, epoch counts, and dataset mixes before finding a checkpoint that passes your evals. Industry estimates suggest 3–10 training runs per successful fine-tune, meaning the compute cost you budget for one run should be multiplied by 3–5 to cover the full project. Add evaluation tooling, integration work, and deployment, and the engineering overhead often exceeds $4,000–$12,000 per project even at a small company.

Layer 4: Serving

Training is a one-time cost; serving is a recurring one. A self-hosted 7B model on an L40S GPU costs roughly $1–4/hour to keep running, or $720–$2,880/month at 100% utilisation. A 70B model needs 2–4× A100s, pushing serving costs to $2,000–$6,000/month. Hosted fine-tuning APIs (OpenAI, Together AI) bill inference per token, which scales with usage rather than time — often cheaper below ~1M tokens/day but more expensive at high throughput.

Method-by-method cost comparison

The cheapest way to fine-tune a model is not the same as the best way for every use case. Here is how the three main routes compare on the compute layer alone, using a 7B parameter base model as the reference point.

Route A: LoRA / QLoRA on a rented GPU

QLoRA (quantised LoRA) lets you fine-tune a 7B–8B model on a single A100 80GB in 4–8 hours, training only a small set of low-rank adapter matrices rather than all weights. The GPU cost for a single run is typically $5–$20 — often under $10 on spot instances. A 13B model on two A100s runs $20–$40. A 70B model via QLoRA on a single H100 takes 12–24 hours, pushing costs to $30–$60 per run. These are the lowest-cost numbers in fine-tuning; the real project cost including data and iteration will be higher, but compute alone stays in the tens-of-dollars range.

Route B: Hosted fine-tuning APIs

OpenAI, Together AI, and similar services handle infrastructure for you and charge per training token. OpenAI's flagship GPT-5 fine-tuning is priced at $25 per million training tokens; the mini tier is dramatically cheaper at roughly $3 per million training tokens. Together AI charges approximately $0.48 per million training tokens for models up to 16B parameters — among the cheapest hosted options for open-source models. A typical fine-tuning dataset of 100,000 tokens trained for 3 epochs consumes 300,000 training tokens; at Together AI rates that is under $0.15, while the mini tier would cost ~$0.90 and the flagship tier ~$7.50.

Provider & Model	Training cost	300K-token run (3 epochs on 100K tokens)	Notes
Together AI (≤16B models)	$0.48 / M tokens	~$0.15	Open-source models, Llama / Mistral / Qwen
OpenAI GPT-5 mini tier	~$3 / M tokens	~$0.90	Managed; inference billed separately
OpenAI GPT-5 flagship tier	$25 / M tokens	~$7.50	Managed; inference billed separately
OpenAI GPT-5 Pro tier	~$25 / M tokens	~$7.50	Larger capability ceiling

The catch with hosted APIs is inference pricing after training. OpenAI charges a premium for fine-tuned model inference — a fine-tuned flagship model's inference runs at a markup over the base model's per-token rates. At high throughput those premiums accumulate quickly.

Route C: Full fine-tuning on multi-GPU clusters

Full fine-tuning updates every weight in the model. For a 7B model this requires at least 40–80 GB of GPU VRAM across multiple GPUs when you factor in optimizer states and activations. An 8× A100 cluster running a full fine-tune of a 7B model for 24–48 hours costs $250–$500 for the compute alone. A 70B model demands substantially more: 8× H100 SXM for 3–7 days puts compute at $4,000–$12,000 per run. Factor in the iteration multiplier and a successful full fine-tune of a 70B model routinely reaches $30,000–$60,000 in compute before data and engineering costs.

Real budget scenarios

Abstract numbers are hard to internalise. Here are four representative scenarios that correspond to actual project types, with all four cost layers filled in.

Scenario	Compute	Data	Engineering	Serving (monthly)	Total (first month)
Weekend LoRA experiment (7B, personal project)	$10–30	$0 (existing data)	$0 (you)	$0 (local GPU)	$10–30
Startup SFT (7B, hosted API, 2K examples)	$1–10	$2,000–5,000 (contractor)	$3,000–6,000	$500–1,500 (API tokens)	$5,500–12,500
Mid-market LoRA (13B, rented cluster, 5K examples)	$200–600 (5 runs)	$5,000–15,000	$8,000–15,000	$1,500–4,000 (self-hosted)	$15,000–35,000
Enterprise full fine-tune (70B, on-prem/cloud)	$30,000–60,000	$30,000–80,000	$20,000–50,000	$5,000–15,000	$85,000–205,000

The weekend experiment is the only scenario where compute dominates. In every other case, data preparation and engineering time are the majority of the budget. This is why "fine-tuning costs $10" and "fine-tuning costs $100,000" are both heard in the same industry conversation: the $10 quote is compute-only, the $100,000 quote is a full production project.

Hidden cost traps to avoid

Several categories of cost reliably catch teams off guard on their first fine-tuning project.

Serving a fine-tuned model is not free

When you fine-tune an open-source model, you get back an artifact — a set of weights — that you need to serve yourself. Unlike a base model accessed through an API, a fine-tuned open-source model requires a live GPU endpoint. Hosting a 7B model on a dedicated A100 costs roughly $960–$2,160 per month at standard cloud rates. If your use case doesn't need a dedicated endpoint (low request volume, batch jobs), consider serverless inference hosting instead, which bills per token rather than by the hour.

Catastrophic forgetting and retraining cycles

If your training data is too narrow, the model can lose general capabilities — a phenomenon called catastrophic forgetting. Discovering this late means re-curating data and rerunning experiments. Pilots that skip proper evaluation before calling a model production-ready often discover this problem after paying for serving infrastructure. Build evaluation into each training run, not just the final one.

Storage and checkpointing

A 7B model checkpoint in fp16 is about 14 GB. A 70B model is ~140 GB. Keeping multiple checkpoints from multiple runs across multiple experiments adds up fast. Cloud object storage (S3, GCS) runs roughly $0.023 per GB per month — storing 10 checkpoints of a 70B model across 5 experiments is 7 TB, or ~$160/month just for storage.

The cost of a wasted run

A training run that fails due to a bug in the data pipeline, a misconfigured learning rate, or out-of-memory crashes wastes real money. On an 8× H100 cluster at $2.69/hr per GPU, a single wasted 12-hour run costs ~$260. Invest in a short sanity run (1–2% of data, 1 epoch) before launching a full job. The sanity run typically costs under $5 and catches the most common failure modes before they become expensive ones.

Going deeper

Once you have a handle on the basic cost landscape, a few advanced considerations become relevant as you scale.

Spot vs on-demand: when interruptibility is worth it

Spot (interruptible) instances on marketplaces like Vast.ai can be 50–80% cheaper than on-demand, but they can be reclaimed mid-run. For training jobs under 4 hours on a single GPU this is usually an acceptable tradeoff — checkpoint every 20 minutes and a restart costs you little. For 48-hour multi-node runs the interruption risk compounds, and a mid-run failure on a cluster job can corrupt the checkpoint state. Many teams use spot for early-stage experiments and switch to reserved capacity for their final production run.

Cost per quality point: the real optimization target

Raw training cost is not the right optimisation target. What matters is cost per unit of quality improvement relative to your baseline (prompting, RAG, or a different model). A $500 fine-tune that moves your task accuracy from 72% to 89% is far better value than a $50 fine-tune that moves it to 74%. Track your eval metric per dollar spent across runs. The run that looks expensive in isolation is often the cheapest path to a given quality level.

When to choose RAG instead

If the goal is injecting factual knowledge — product catalogs, internal documentation, live data — RAG is almost always cheaper and more maintainable than fine-tuning. A retrieval pipeline over a vector database costs a fraction of a fine-tune and updates instantly when your data changes. Fine-tuning earns its cost when the goal is changing behaviour (style, format, reasoning patterns, task specialisation) rather than knowledge.

Merging LoRA adapters to eliminate serving overhead

A LoRA adapter is a small delta on top of the base model. Most serving frameworks let you merge the adapter weights back into the base model after training, producing a single model file with no runtime overhead from the adapter. This means you can fine-tune cheaply with LoRA, then serve the merged model at the same speed and cost as the base model — the best of both worlds. The merge operation takes minutes and costs nothing. Always merge before deploying to a production endpoint.

FAQ

How much does it cost to fine-tune a 7B model like Llama?

With QLoRA on a single rented A100 80GB, a full fine-tuning run on a 7B model typically takes 4–8 hours and costs $5–$15 in compute on marketplaces like RunPod or Vast.ai. Using a hosted API like Together AI, training on 100,000 tokens for 3 epochs costs under $0.20. The limiting factor is rarely compute — it is data preparation and the number of experimental runs needed to reach a good result.

Is fine-tuning through OpenAI's API cheaper than renting a GPU?

For small datasets and one-off experiments, OpenAI's API is often cheaper because there is no minimum billing and no infrastructure to manage. For large datasets (>10M training tokens) or frequent re-training, renting GPUs and running open-source fine-tuning tooling usually wins on raw compute cost. The right answer depends on your volume, how much you value convenience, and whether you need the specific capabilities of a GPT model.

What is the cheapest way to fine-tune an LLM?

The cheapest route is QLoRA on a spot GPU instance — a 7B model on a single consumer GPU (RTX 4090 at ~$0.16/hr on Vast.ai) costs as little as $2–5 for a short training run on an existing clean dataset. The total budget for a minimal experiment, including data you already have, can genuinely be under $20. Costs rise sharply when you need professional annotation, multiple experiment runs, or production serving.

What does it cost to serve a fine-tuned open-source model in production?

Serving a 7B model on a dedicated A100 80GB endpoint costs roughly $1–1.30/hr, or $720–$950/month at full utilisation on providers like Lambda Labs or RunPod. A 70B model needs 2–4× A100s, pushing monthly serving costs to $2,000–$5,000+. If your request volume is low, serverless GPU inference (billed per token rather than per hour) is much cheaper — many providers offer this for popular open-source architectures.

How much of a fine-tuning budget should I allocate to data preparation?

Industry experience suggests data preparation consumes 20–40% of the total fine-tuning budget on average, and often more on the first project when processes aren't established. If you are building a dataset from scratch with professional annotation, budget at least $0.50–$5.00 per example plus QA overhead. A 1,000-example dataset with moderate complexity can easily cost $2,000–$5,000 to prepare properly.

Does fine-tuning cost more than just using a larger base model with better prompting?

Often, yes — at least up front. A more capable base model (e.g. upgrading from a mini tier to a flagship model like GPT-5.5) can close a quality gap that would otherwise require fine-tuning, with zero training cost. Before committing to a fine-tuning project, run a benchmark comparing your target task on the cheaper model with fine-tuning versus the more expensive base model with good prompting. The upgrade path is frequently cheaper and faster to iterate on.

// In plain English

// Why the cost question matters

// How the cost is built up

Layer 1: Compute

Layer 2: Data

Layer 3: Engineering

Layer 4: Serving

// Method-by-method cost comparison

Route A: LoRA / QLoRA on a rented GPU

Route B: Hosted fine-tuning APIs

Route C: Full fine-tuning on multi-GPU clusters

// Real budget scenarios

// Hidden cost traps to avoid

Serving a fine-tuned model is not free

Catastrophic forgetting and retraining cycles

Storage and checkpointing

The cost of a wasted run

// Going deeper

Spot vs on-demand: when interruptibility is worth it

Cost per quality point: the real optimization target

When to choose RAG instead

Merging LoRA adapters to eliminate serving overhead

// FAQ

// Further reading

// Related

In plain English

Why the cost question matters

How the cost is built up

Method-by-method cost comparison

Real budget scenarios

Hidden cost traps to avoid

Going deeper

FAQ

Further reading

Related