What Is Hosted Fine-Tuning? Tuning via API Instead of GPUs

Q: What file format do hosted fine-tuning APIs require?

All major platforms use **JSONL** (JSON Lines): a plain text file where each line is a valid JSON object representing one training example. For chat-based models, each line contains a `messages` array with `role` and `content` fields, matching the chat completions format. Never use a JSON array or CSV — the format must be one JSON object per newline, no trailing commas.

Learn the upload-a-file path to a custom model: what hosted fine-tuning platforms handle for you, what they charge, and the lock-in you accept.

INTERMEDIATE14 MIN READUPDATED 2026-06-12

In plain English

Hosted fine-tuning means you customize a language model by uploading a data file to a cloud API — no GPU rental, no CUDA setup, no training loop to write. You hand the provider a JSONL file of examples, set a few hyperparameters, and the platform runs the training job on its own hardware. When training finishes, a custom model endpoint is waiting for you to call.

Hosted Fine-Tuning — diagram — Hosted Fine-Tuning — pasqualepillitteri.it

Think of it like a commercial print shop. You bring your design file; they own the printing presses. You don't need to understand how offset lithography works to get a thousand flyers made. Hosted fine-tuning is the same deal for model training: you supply the knowledge (your examples), the platform supplies the compute (their GPUs), and you walk away with a finished product.

The alternative is self-hosted fine-tuning: you rent or own GPU instances, install training libraries like Hugging Face Transformers or Axolotl, write a training script, manage data loading, handle checkpointing, and debug CUDA out-of-memory errors yourself. That path is powerful and flexible, but it has a steep ramp. Hosted fine-tuning trades some of that flexibility for a dramatically simpler developer experience.

Why it matters

Fine-tuning a model from scratch on your own infrastructure used to require a machine-learning engineer, a GPU cluster, and weeks of iteration. Hosted fine-tuning compresses that to an afternoon. That shift changes who can fine-tune: a solo developer with a credit card and a few hundred good examples can now build a custom model that outperforms a base model on their specific task.

The practical impact breaks into three areas:

Speed. A training job on OpenAI or Together AI for a modest dataset (a few thousand examples) often completes in under an hour. Self-hosted jobs at the same scale require provisioning, installing dependencies, debugging environments, and monitoring — adding hours or days before the first token of your custom model.
No infrastructure tax. You pay only for training tokens and inference calls. There are no idle GPU costs while your machine waits between jobs, no cluster to resize, and no on-call obligations if a GPU node fails.
Accessibility. Teams without ML infrastructure expertise — product engineers, domain specialists, startups — can now fine-tune production-quality models. The barrier is knowing your task and having good data, not knowing how to configure distributed training.

The tradeoff is real though. You get convenience by giving up control. You can only fine-tune models the platform exposes, you cannot inspect or modify the training loop, and if the provider changes pricing or deprecates your model, your deployment depends on their decision. That lock-in has a cost that escalates the more deeply you integrate.

How it works

Every hosted fine-tuning platform follows the same four-step loop, even if the API calls differ. Understanding the pattern once lets you switch platforms without re-learning the workflow.

// Hosted fine-tuning pipeline

Prepare dataJSONL with messages arrayUpload filePOST /files, purpose=fine-tuneCreate jobPOST /fine-tuning/jobsPoll statusGET job until succeededCall modelUse ft: model ID in your app

The JSONL training file

Every major platform expects training examples as JSONL (JSON Lines): one JSON object per line, no trailing commas, no wrapping array. Each line is one training example. For chat-style models, each example is a messages array with role and content fields, exactly like the messages you send during inference:

training.jsonl — one line per examplejson

{"messages": [{"role": "system", "content": "You are a triage assistant for a software help desk."}, {"role": "user", "content": "My login page shows a 502 error after the latest deploy."}, {"role": "assistant", "content": "Priority: HIGH. Category: Infrastructure. Assign to: Platform team. The 502 suggests the app server is not reachable — check load-balancer health targets first."}]}
{"messages": [{"role": "system", "content": "You are a triage assistant for a software help desk."}, {"role": "user", "content": "The font on the dashboard looks slightly different from last week."}, {"role": "assistant", "content": "Priority: LOW. Category: UI. Assign to: Frontend team. Likely a CSS change in the last release — no user impact, can be addressed in next sprint."}]}

Upload, kick off, and poll

You upload the file first, receive a file ID, then create a fine-tuning job that references that ID. The platform validates your data (checking format, token counts, minimum example counts), queues the job, runs it, and returns a model ID when it completes. The entire exchange is a few short API calls:

Kick off a fine-tuning job with the OpenAI SDKpython

from openai import OpenAI

client = OpenAI()  # OPENAI_API_KEY from env

# 1. Upload the training file
with open("training.jsonl", "rb") as f:
    file_obj = client.files.create(file=f, purpose="fine-tune")

print("File ID:", file_obj.id)  # e.g. file-abc123

# 2. Create the fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file=file_obj.id,
    model="gpt-5.5",
    hyperparameters={"n_epochs": 3},
)

print("Job ID:", job.id)  # e.g. ftjob-xyz789

# 3. Poll until done (simplistic — use webhooks in production)
import time
while True:
    status = client.fine_tuning.jobs.retrieve(job.id)
    print(status.status)
    if status.status in ("succeeded", "failed", "cancelled"):
        break
    time.sleep(30)

# 4. Use the custom model
if status.status == "succeeded":
    ft_model = status.fine_tuned_model
    print("Custom model:", ft_model)  # e.g. ft:gpt-5.5:org::abc
    resp = client.chat.completions.create(
        model=ft_model,
        messages=[{"role": "user", "content": "My login page shows a 502."}],
    )
    print(resp.choices[0].message.content)

What the platform does in the background

From the moment you submit the job, the provider handles everything you would have to do yourself on bare metal: validating and tokenizing your training examples, splitting a validation set if you did not supply one, loading the base model weights onto GPU memory, running the forward and backward passes for each epoch, checkpointing periodically, running eval metrics, and finally registering the resulting weights under your account as a callable model endpoint. You never see any of this — it either succeeds or surfaces a clear error.

Platform landscape: who offers what

The hosted fine-tuning market has consolidated around a handful of platforms with meaningfully different trade-offs. Which one makes sense depends on whether you need a proprietary frontier model, a specific open model, pricing structure, or data-privacy guarantees.

Platform	Tunable models	Training cost (approx)	Notes
OpenAI	GPT-5 family (flagship and mini/nano tiers)	Per-token; scales with model tier	Tight API; proprietary models only; inference at standard rates
Google Vertex AI	Gemini Flash and Pro tiers	Per-token; scales with model tier	Inference at same price as base model — no markup
Together AI	Llama, Mistral, Qwen, Gemma, and others	Per-token; scales with model size	Open models only; dedicated endpoint needed for serving
Fireworks AI	Llama, Mistral, FireFunction, others	Per-token; scales with model size	LoRA focus; up to 100 LoRA adapters on one shared deployment
AWS Bedrock	Amazon Nova, Titan, Cohere Command	~$8/1K tokens (Nova Pro)	Requires provisioned throughput for inference; VPC-friendly

OpenAI: the default starting point

OpenAI's fine-tuning API is the most documented and most copied. You tune a model from the GPT-5 family and pay the same inference rates as the base model after training, with cheaper per-token training on the smaller mini and nano tiers. The dashboard shows training and validation loss curves, lets you compare completions side-by-side before and after tuning, and supports vision fine-tuning on image inputs. The hard constraint is that you can only tune OpenAI's own models — there is no path to a Llama or Mistral fine-tune through this platform.

Google Vertex AI: no inference markup

Vertex AI supervised fine-tuning of a Gemini Flash-tier model is priced per training token, and the resulting tuned model is served at the same per-token price as the base Flash model. That means there is no hidden "you pay extra for the custom model" tax on every inference call — a meaningful advantage at high request volumes. You interact via the google-cloud-aiplatform SDK or the Vertex AI console.

Together AI and Fireworks: open-model fine-tuning

If you want to fine-tune an open-weight model — Llama, Mistral, or Qwen — without setting up your own training infrastructure, Together AI and Fireworks AI both offer hosted fine-tuning via API. Training is priced per token, scaling with model size. The key difference from OpenAI and Vertex is the serving model: after training on Together, your fine-tuned weights run on a dedicated endpoint priced at GPU-hour rates (roughly $6–12/hr on H100/B200). Fireworks lets you share a LoRA adapter across a base model deployment, which can be cheaper if you have moderate traffic.

Lock-in and trade-offs you are accepting

Hosted fine-tuning is not free. The convenience you gain comes with real trade-offs that are easy to underestimate when you are in a hurry to ship.

// Hosted vs. self-hosted fine-tuning

Hosted (OpenAI, Vertex, Together)

No GPU setup — start in minutes
Pay only training tokens + inference
Limited to platform's supported models
No access to training internals or checkpoints
Provider can deprecate your model version
Data leaves your environment for training

Self-hosted (Axolotl, Transformers, unsloth)

Full model zoo — any open model
Full control: LoRA rank, optimizer, scheduler
Keep checkpoints, export to any format
GPU provisioning and ops burden
CUDA, memory, and distributed-training debugging
Data stays in your infrastructure

Model deprecation risk

When OpenAI retires a base model, fine-tuned versions built on that base are retired too — often on the same schedule. If your production system calls a fine-tuned endpoint built on an older base model and that checkpoint gets deprecated, you must retrain against a newer base. Providers typically give 3–6 months of notice, but every cycle costs training time, dollars, and the regression-testing overhead of validating the new model behaves the same as the old one.

Data privacy

Your training data travels to the provider's infrastructure. OpenAI's enterprise tier and Azure OpenAI offer contractual guarantees that training data is not used to improve their base models. Standard-tier accounts do not have those guarantees by default — check the data usage policy before uploading anything that includes PII, proprietary data, or content under NDA. Vertex AI and Bedrock run within your existing Google Cloud or AWS account, which can give your security team a cleaner compliance story.

No access to weights

When you fine-tune on OpenAI, Google, or AWS, you never receive the actual model weights. You get a model ID you can call — but you cannot download the checkpoint, run it locally, switch inference providers, or inspect what the training changed. This is fundamentally different from self-hosted fine-tuning, where the checkpoint is a file you own. For most teams this is irrelevant; for teams with hard data-residency or portability requirements, it is a deal-breaker.

Data requirements and practical tips

The most common reason a hosted fine-tune fails to improve performance is data quality, not the platform or the model. Training the job itself is easy — preparing data that actually teaches the model something new is the hard part.

Minimum example counts

Most platforms enforce a lower bound: OpenAI requires at least 10 examples to start a job (in practice, 50–100 gives you something worth evaluating; 500–1000 is a typical production starting point). The right number is not a fixed answer — it depends on how different your task is from the base model's existing behavior. Teaching a model to respond in a new language or format takes fewer examples than teaching it brand-new domain knowledge.

Format your training data like your production prompts

The system prompt in your training examples should match exactly what you will send at inference time. If your production calls include a system prompt, include the same one in every training example. If you fine-tune on examples without a system prompt but then add one in production, the model saw a different context distribution during training — results will be inconsistent.

Hold out a validation split

All major platforms accept an optional validation_file alongside your training file. Supply it. Watching validation loss alongside training loss is the only reliable way to catch overfitting early — when training loss falls but validation loss rises, the model is memorizing your examples rather than generalizing from them. Without a validation file, you find out about overfitting much later, after you have already paid for inference calls on a model that degraded.

Token limits and cost estimation

Training cost is (total tokens in dataset) x (number of epochs) x (price per million tokens). A 500-example dataset where each example averages 400 tokens is 200,000 tokens. At 3 epochs that is 600,000 training tokens. At a small-tier rate of, say, $3/M tokens the job costs $1.80; at a flagship-tier rate of $25/M tokens it costs $15. The cheaper mini/Flash tiers cost far less per token to train than the flagship models — count your tokens and check the current rate before choosing your base model.

Going deeper

Once you have run a few hosted fine-tunes and understand the workflow, here are the considerations that separate teams that ship reliable fine-tuned models from teams that chase diminishing returns.

Hyperparameter tuning: what you can actually control

Most hosted platforms expose a narrow set of hyperparameters: number of epochs (n_epochs), learning rate multiplier, and sometimes batch size. That is intentionally limited. Unlike self-hosted training where you can set optimizer, warmup schedule, gradient accumulation, and weight decay independently, hosted fine-tuning trades control for convenience. For the majority of use cases, the defaults are well-chosen and touching the learning rate multiplier is the only lever worth pulling. Start with the defaults, measure whether the model improved, then experiment with epochs if you see underfitting (loss still falling at the end) or overfitting (validation loss rising).

When to migrate to self-hosted

Hosted fine-tuning becomes the wrong tool when: (1) you need a model architecture not exposed by any platform — for example, a specific 0.5B parameter model optimized for edge deployment; (2) your data contains information that cannot leave your network boundary; (3) you are running so many inference requests that even the "no markup" Vertex pricing is more expensive than owning a dedicated GPU; or (4) you need full checkpoints to run side-by-side A/B experiments or to merge adapters. In each of those cases, the hosted abstraction is working against you.

Reinforcement fine-tuning and beyond SFT

Hosted platforms have steadily expanded beyond basic SFT. OpenAI added reinforcement fine-tuning in 2025, which lets you train a model with a reward signal (a grader function that scores outputs) rather than only fixed-label examples — better for tasks where multiple outputs are valid or where you want the model to optimize a measurable outcome like code correctness. AWS Bedrock followed with reinforcement fine-tuning support in late 2025. These methods deliver larger improvements on complex reasoning tasks but require you to write a grader, making them harder to set up than simple JSONL fine-tuning.

Evaluating a hosted fine-tune before you ship

Validation loss going down does not mean your model is better at your task — it means it is learning the statistical patterns in your training examples. The only trustworthy evaluation is a task-specific eval set: real (or realistic) inputs your model will face in production, with expected outputs you can score programmatically or with a judge model. Before deploying any fine-tuned model, run it head-to-head against the base model and against your previous best system on that eval set. Hosted platforms give you a training loss curve; it is your job to supply the ground truth on whether the model actually improved.

FAQ

What file format do hosted fine-tuning APIs require?

All major platforms use JSONL (JSON Lines): a plain text file where each line is a valid JSON object representing one training example. For chat-based models, each line contains a messages array with role and content fields, matching the chat completions format. Never use a JSON array or CSV — the format must be one JSON object per newline, no trailing commas.

How much does hosted fine-tuning cost compared to running a GPU yourself?

Training a 500-example dataset on a small mini- or Flash-tier model costs roughly $1–5 for the training job itself. A single H100 GPU hour costs $2–4 on spot pricing. For small to medium datasets, hosted fine-tuning is dramatically cheaper because you pay only for active compute, not idle GPU time. The calculus flips at large scale or high inference volume, where dedicated hardware becomes more economical.

Can I download the weights of a model I fine-tuned on OpenAI or Google?

No. When you fine-tune on OpenAI, Vertex AI, or AWS Bedrock, you receive a model ID you can call via API, but the weights remain on the provider's infrastructure. You cannot download, port, or self-host those weights. If you need to own the weights, fine-tune an open model through Together AI or Fireworks AI, or use a self-hosted training framework.

How many examples do I need to fine-tune a hosted model?

Platforms typically require a minimum of 10 examples, but 50–100 is the practical minimum to see meaningful improvement. For production use, 500–2000 high-quality examples is a common starting point. More examples help, but quality matters more than quantity — 100 well-curated examples consistently outperform 1000 noisy ones.

What happens to my fine-tuned model when the provider deprecates the base model?

Your fine-tuned model is typically deprecated on the same timeline as the base model it was built on, often with 3–6 months of notice. You will need to retrain against a supported base model before the cutoff. The new training job is straightforward — reuse your original JSONL file — but you should budget time for regression-testing the new fine-tuned model before switching production traffic.

Is my training data kept private when I use a hosted fine-tuning API?

It depends on the tier and provider. OpenAI's enterprise and Azure OpenAI tiers offer contractual guarantees that training data is not used to train their base models. Standard consumer API tiers have weaker guarantees. Vertex AI and Bedrock run inside your existing cloud account, which gives security teams a cleaner compliance story. Always read the provider's data processing addendum before uploading proprietary or regulated data.

// In plain English

// Why it matters

// How it works