RL
Also known as: reinforcement learning, reward-based training, RL training
RL is one of the core ideas in how modern AI models are trained beyond the initial pretraining phase. Instead of just predicting the next token in a text (as in pretraining), an RL-trained model tries actions, gets a score on how good those actions were, and updates its behavior to get higher scores over time. You can think of it as training by trial and correction rather than by example.
In the context of language models, RL shows up most prominently in RLHF (Reinforcement Learning from Human Feedback), where human raters score model outputs and those scores shape the model's future behavior. It also underpins reasoning models: systems like o3 or extended-thinking models that 'think longer' on hard problems have been trained with RL to practice problem-solving strategies that actually work.
Why it matters as a builder concept: when you hear that a model is 'RL-tuned for coding' or that a lab 'scaled up RL training,' it means they've invested in teaching the model to pursue better outcomes, not just better-sounding outputs. RL is also the mechanism behind reward hacking, one of the messier failure modes to understand.