Batch inference
Also known as: batch processing, offline inference, async inference, bulk inference
Most discussions of LLM infrastructure assume real-time use: a user sends a message, the model responds in seconds. But many AI workloads are not time-sensitive. Enriching a database of customer records with AI-generated summaries, scoring thousands of support tickets for sentiment, or running an eval suite across many test cases — these all need results eventually, not immediately. Batch inference handles exactly this: you submit a large set of inputs, the system processes them when capacity is available, and you collect the results.
Batch inference providers like OpenAI's Batch API offer significant cost reductions (often 50% or more) relative to real-time API calls, in exchange for a turnaround time measured in hours rather than milliseconds. For high-volume data pipelines, this pricing difference can be the difference between a financially viable product and one where inference costs swallow the margin.
For builders, batch inference shows up most in LLMOps workflows: generating embeddings for large document sets, running quality evaluations across historical conversations, or enriching product catalog data with AI-generated descriptions. It pairs naturally with asynchronous task queues and is often where teams first realize that not every AI call needs to be real-time — and that treating them differently produces meaningful cost savings.