Pretraining
Also known as: pre-training, foundation model pretraining, LLM pretraining, next-token prediction training
Pretraining is where the base capabilities of a large language model come from. The model is fed an enormous corpus of text, often hundreds of billions to trillions of tokens drawn from the web, books, code, and other sources, and trained with a simple objective: given the previous tokens in a sequence, predict what comes next. Doing this at scale, across enough diverse data with enough parameters, causes the model to implicitly learn language structure, factual knowledge, and surprisingly broad reasoning abilities.
This phase is expensive in both compute and time. Training runs for frontier models like GPT-5 or Claude Opus consume thousands of GPUs or TPUs running for weeks or months. The resulting artifact is a base model, sometimes called a pretrained model or foundation model. It knows a lot about language and the world, but it does not yet know how to follow instructions or behave safely as an assistant. Those qualities come from the stages that follow, such as supervised fine-tuning and RLHF (reinforcement learning from human feedback), collectively called post-training.
Scaling laws, a set of empirical observations from AI research, describe how model quality improves predictably as you increase training compute, dataset size, and parameter count during pretraining. This relationship has driven much of the investment in frontier AI: labs compete to do larger pretraining runs because larger runs reliably produce more capable base models. For builders, pretraining is largely invisible day-to-day, but it determines the ceiling of what a model can ever be good at, no matter how much you fine-tune it afterward.