Inference-time compute
Also known as: test-time compute, TTC, inference scaling
The original recipe for making AI smarter was to train a bigger model on more data. Inference-time compute is a different lever: let the model spend more computation on a single response. It can explore multiple reasoning paths, check its own work, or sample and compare many candidate answers before settling on one.
This shift matters because it unlocks a different scaling curve. Rather than needing more GPUs during training, you can run a smaller model longer to get better answers for hard problems. Builders can dynamically allocate more inference budget to complex tasks and less to simple ones, which enables cost-efficient design.
The concept is closely tied to reasoning models. When you turn on 'extended thinking' in Claude or use o3 in OpenAI's API, you are effectively increasing inference-time compute. The practical question for builders becomes: how much thinking time is worth buying for this particular task?