Synthetic data
Also known as: AI-generated training data, synthetic training data
Human-labeled data is expensive, slow, and hard to scale. Synthetic data uses AI models themselves to generate examples: a strong frontier model produces prompt-response pairs, reasoning chains, or preference comparisons that are then used to train other models. This creates a feedback loop where better models produce better training data.
Synthetic data is now central to how leading models are trained. Reasoning models in particular rely on it heavily: generating thousands of chain-of-thought solutions to math and coding problems, filtering for correct ones, and using those as SFT examples or reward signals for RL. The Llama 4 and DeepSeek families both lean on synthetic data pipelines extensively.
The risks are real. If the model generating synthetic data has biases or errors, those propagate into the models trained on it. And if models are trained repeatedly on AI-generated content without grounding in new human-created data, quality can degrade over successive generations, a phenomenon sometimes called model collapse. Quality filtering, diversity controls, and mixing synthetic with human data are standard mitigations.