AI/TLDR

What Is Synthetic Training Data? Using LLMs to Train LLMs

See how teams use a strong LLM to mass-produce training examples for a cheaper one — and where machine-generated data quietly goes wrong.

BEGINNER13 MIN READUPDATED 2026-06-12

In plain English

Synthetic training data is training data written by a model instead of a human. Rather than paying annotators to label thousands of examples, you ask a strong, capable LLM — called the teacher — to generate those examples on demand. The resulting dataset can then be used to fine-tune a smaller, cheaper model for a specific task.

The analogy that clicks for most people: imagine you are training a new junior employee. You could shadow senior staff for months and slowly absorb their tacit knowledge — or you could have the seniors write a comprehensive training manual overnight. Synthetic data is that manual. A senior model (strong, expensive) writes practice problems and model answers; the junior model (small, cheap) studies them and learns to behave like the senior on your specific job.

This idea sounds almost circular — using AI to teach AI — and that is exactly why it is interesting, powerful, and risky at the same time. When it works well you can train a compact model to match a frontier model on a narrow task for a fraction of the cost. When it goes wrong, errors compound through generations until the model produces homogeneous, degraded output. Understanding both sides is the whole point of this article.

Why it matters

Before synthetic generation became practical, building a fine-tuning dataset required human annotators. That meant weeks of recruitment, briefing, labeling, adjudication, and quality review — plus the cost, which for a serious dataset typically ran into tens of thousands of dollars before training even started.

A strong LLM collapses that bottleneck. You write a generation prompt once, run it overnight, and wake up to thousands of labeled examples. The speed and cost advantage is dramatic enough that synthetic data has become the default starting point for most fine-tuning projects in 2024–2025.

  • Speed. Generating 10,000 examples takes hours instead of weeks. The whole experiment cycle — generate, train, evaluate, iterate — compresses from months into days.
  • Cost. API calls to a teacher model typically cost a fraction of a cent per example. A 10,000-example dataset often costs under $50 to generate, versus thousands for professional annotation.
  • Privacy. You never hand user data to human labelers. The teacher model generates fictional but realistic inputs from scratch, keeping real data inside your environment.
  • Rare-case coverage. You can explicitly prompt for edge cases — ambiguous phrasing, adversarial inputs, low-frequency topics — that would be badly underrepresented in any naturally collected dataset.
  • Consistency. A well-prompted model applies the same rubric to every example. Human annotators introduce disagreements and fatigue-related drift that add noise to the training signal.

How it works

At its simplest, synthetic data generation is a two-step loop: prompt a strong model to produce an example, then decide whether to keep it. In practice there are four recognizable stages that every serious pipeline goes through.

Step 1: the generation prompt

The generation prompt tells the teacher what kind of example to produce: the task, the format, and — critically — an instruction to vary the output. A prompt that says "write a customer support ticket" produces near-identical tickets on every call. A prompt that says "write a support ticket from the perspective of a confused first-time user, about the following product category: {domain}" produces genuine variety. Rotating through a list of domains, personas, difficulty levels, or styles is the primary lever for dataset diversity.

Step 2: the teacher model

The teacher should be the strongest model you have access to for the task. Teacher quality is a hard ceiling on student quality — a weak teacher produces flawed examples, and the student learns those flaws. In practice, GPT-4-class models, Claude Opus-class models, or large open models like Llama 3.1 405B are common choices. The teacher only runs during dataset generation; you pay for it once.

Step 3: filtering

Raw generated examples contain noise: malformed format, near-duplicate outputs, incorrect labels, or hallucinated facts. Filtering is not optional — even 5% bad examples can noticeably degrade a fine-tune. Common filtering approaches include: running a cheaper LLM as a judge to rate each example on correctness and quality; using rule-based checks (valid JSON, minimum length, label vocabulary match); and removing near-duplicates by embedding examples and discarding ones with cosine similarity above a threshold.

Step 4: format for training

Filtered examples are converted to the instruction-response JSONL format that fine-tuning APIs expect: a system prompt, a user turn, and an assistant turn per row. The file is split into train and validation sets — typically 90/10 — and the validation set should, wherever possible, contain real human-written inputs rather than more synthetic data.

Landmark examples: where synthetic data proved itself

Synthetic training data is not theoretical. Several well-known models were built on it, and tracing those examples makes the concept concrete.

Self-Instruct and Stanford Alpaca (2022–2023)

Self-Instruct (Wang et al., 2022) was the paper that established the modern playbook. The method starts with about 175 human-written seed instruction–response pairs, then prompts a language model to generate new instructions that are dissimilar from any existing one (using a ROUGE-L similarity threshold below 0.7 as a filter). Each surviving instruction gets an input and output generated by the same model. The pool grows iteratively: new examples pass through the diversity filter and are added back as seeds for the next round. Evaluated on SuperNI and user-oriented tasks, Self-Instruct tuning improved vanilla GPT-3 by roughly 33 percentage points, approaching the performance of InstructGPT trained with human feedback — using only model-generated data.

Stanford Alpaca (2023) applied a modified version of this pipeline to generate 52,000 instruction-following examples using text-davinci-003, then fine-tuned LLaMA-7B on the result. The total data generation cost was under $500. The fine-tuned 7B model exhibited instruction-following behavior qualitatively similar to GPT-3.5, demonstrating that a small open model could be made highly capable with synthetic data alone.

Phi-1: textbook-quality synthetic code data (2023)

Microsoft's phi-1 model (1.3B parameters) pushed the idea in a different direction: instead of instruction following, it focused on generating pedagogically structured Python coding content. Approximately 1 billion tokens of synthetic textbooks and coding exercises were generated with GPT-3.5, complementing 6 billion tokens of filtered real code. The result: phi-1 scored 50.6% pass@1 on the HumanEval benchmark — competitive with models 5–10 times its size trained on raw web code. The insight was that dense, well-structured educational content transfers more learning per token than scraped, noisy data.

Magpie: synthesis from nothing (2024)

The Magpie method (accepted at ICLR 2025) discovered that instruction-tuned models generate training data spontaneously when you supply only the pre-query template — the part of the prompt that comes before the user message — and let the model fill in the rest. No seed tasks are required. A single Llama-3-Instruct model generated 4 million instruction–response pairs this way; after filtering down to 300,000 high-quality examples, fine-tuning Llama-3-8B-Base on them outperformed datasets built from human curation (ShareGPT, WildChat, UltraChat) on standard alignment benchmarks.

ProjectTeacher modelExamples generatedKey result
Self-Instruct (2022)GPT-3~52,000+33 pts on instruction-following over base GPT-3
Stanford Alpaca (2023)text-davinci-00352,0007B model matches GPT-3.5 quality for ~$500 in data costs
phi-1 (2023)GPT-3.5 / GPT-41B tokens (synthetic)1.3B model scores 50.6% on HumanEval, rivals 10B+ models
Magpie (2024–25)Llama-3-Instruct4M (300K kept)Outperforms all major human-curated open alignment datasets

The model-collapse risk

Synthetic data has a fundamental failure mode: model collapse. When a model is trained on synthetic data, then used to generate more synthetic data for the next model, errors and gaps from the first generation are baked into the second generation's training set. The second model inherits and amplifies those errors. Repeat this several times and the model's outputs become homogeneous, repetitive, and increasingly detached from the true distribution of the real world.

The canonical research demonstration comes from Oxford's Shumailov et al., published in Nature in July 2024. They showed empirically that models trained recursively on their own outputs degrade in two measurable ways: the estimated distribution mean drifts away from the true mean, and variance collapses toward zero — meaning the model loses diversity and eventually produces the same outputs regardless of input. In one illustrative test, a model asked to write about medieval architecture produced coherent text in generation one; by generation nine, it had devolved to listing jackrabbits.

What collapse looks like in practice

  • Repetitive outputs. The model gives near-identical answers regardless of how the prompt varies. Fine-tuned customer support bots that reply with the same boilerplate to every query are a common symptom.
  • Lost tail knowledge. Rare facts, minority languages, and edge-case reasoning disappear first. The model converges on the most frequent patterns in its synthetic training set and forgets everything else.
  • Compounding hallucinations. If the teacher hallucinated a fact, the student learns it as truth, and any future synthetic data generated by the student propagates the error further.
  • Calibration drift. The model's confidence no longer reflects actual accuracy. It becomes confidently wrong on topics where the synthetic data was thin.

How to avoid it

The core principle from the research is straightforward: do not train exclusively on synthetic data, and do not feed a model's own outputs back into its own training without real-data anchoring. Practically this means:

  • Always mix in real data. Research consistently shows that blending real and synthetic data outperforms either alone. Even a 10–30% real-data fraction dramatically stabilizes training and prevents distribution drift.
  • Do not iterate closed loops. If you use a fine-tuned model to generate the next round of training data, you are one step into a recursive loop. Keep the teacher fixed — use a strong, separately trained model for generation, not the model you are currently training.
  • Filter aggressively. Every round of generation should include quality filtering. Removing low-confidence examples before training slows the compounding of errors.
  • Evaluate on real human inputs. Your evaluation set must contain real user-written inputs, not more synthetic data. Synthetic evals can look great while the model silently degrades on real traffic.

Going deeper

The basics above cover the most common use case: generate instruction–response pairs for supervised fine-tuning. Here is where the field has moved in 2024–2025.

Rejection sampling for preference data

Synthetic data is not limited to supervised fine-tuning. One of its most impactful uses is generating preference data for RLHF and DPO. The technique is called rejection sampling: the teacher generates multiple candidate responses to the same prompt, a reward model scores each, and the highest-scoring responses become the "chosen" examples while lower-scoring ones become "rejected." This process generates preference pairs at scale without human labelers ranking individual outputs — it has become a standard step in post-training pipelines for frontier models.

Reasoning traces as training data

For tasks requiring multi-step reasoning — math, code, logic — the most valuable synthetic data is not just the final answer but the entire chain-of-thought trace leading to it. A strong reasoning model works through a problem step by step; the whole solution becomes the training target. The student learns to reason through problems the same way. This is a primary reason small open models have become surprisingly capable at math and coding: they were trained on worked solutions from much larger reasoning models.

Verifiable rewards: the code and math special case

The cleanest form of synthetic data quality control applies to code and math: generated outputs can be verified automatically by running a test suite or checking a numerical answer. This removes the need for an LLM judge and eliminates the risk of being fooled by plausible-sounding wrong answers. If your task has a computable ground truth — SQL queries with expected outputs, Python functions testable against unit tests, arithmetic problems with numeric answers — use execution-based filtering instead of model-based scoring. This is why code-specific models have benefited so dramatically from synthetic data: correctness is cheap to verify.

Scaling laws for synthetic data

A 2025 systematic study (Demystifying Synthetic Data in LLM Pre-training) found that mixing any category of synthetic data with real web data substantially improves performance compared to using synthetic data alone. Importantly, the optimal synthetic fraction is task-dependent: for rephrased data, 33% and 67% mixtures perform similarly; for textbook-style content, a 33% synthetic fraction significantly outperforms 67%. This means there is no universal optimal ratio — the right mix requires empirical tuning on your specific task and data types.

FAQ

What is synthetic training data in simple terms?

It is training data written by an AI model instead of a human. You give a strong LLM a description of the task and it generates thousands of labeled examples — question-and-answer pairs, classification inputs, or instruction-response pairs — that you can use to fine-tune a smaller model. The strong model is the teacher; the small model you are training is the student.

Why use synthetic data instead of collecting real data?

Collecting and labeling real data is slow, expensive, and sometimes impossible due to privacy or legal constraints. Synthetic generation is fast (overnight for tens of thousands of examples), cheap (often under $100 in API costs), and lets you control exactly what kinds of examples you want — including rare edge cases that would be underrepresented in any naturally collected dataset.

What is the Self-Instruct method?

Self-Instruct is a technique introduced in 2022 that bootstraps a large dataset from a small set of seed examples. You start with around 175 human-written instruction–response pairs, then prompt an LLM to generate new instructions that are sufficiently different from existing ones (filtered by ROUGE-L similarity). For each new instruction the model also generates the corresponding input and output. Surviving examples are added back to the seed pool and the loop repeats, growing the dataset iteratively. Stanford Alpaca used a version of this pipeline to generate 52,000 training examples for under $500.

What is model collapse and how does it happen?

Model collapse is the progressive degradation that occurs when a model is trained on synthetic data generated by a previous model, which was itself trained on synthetic data — a recursive loop. Each generation amplifies errors and reduces diversity until the model produces homogeneous, repetitive outputs. The 2024 Nature paper by Shumailov et al. demonstrated this empirically: after just a few generations of recursive synthetic training, models lost the ability to produce diverse, accurate outputs. The fix is to anchor each training round with real human-generated data and avoid closed-loop generation pipelines.

How is model collapse different from regular overfitting?

Overfitting happens when a model memorizes its training set and fails to generalize. Model collapse is a distributional problem: each generation of synthetic data is a slightly impoverished copy of the previous generation, so the training distribution itself degrades over iterations. A collapsed model has not memorized its data — it has converged on a narrow, low-diversity output mode because the data it trained on was already narrow and low-diversity.

Can I use synthetic data for evaluation as well as training?

You should not. Synthetic eval sets create an illusion of quality: a model trained on synthetic data will score well on synthetic evals even when it performs poorly on real user inputs. Always hold out a sample of real, human-written inputs for evaluation. Keep synthetic data in the training split only.

Further reading