Concept·Infrastructure·Added 2 months ago

Batch inference

Also known as: batch processing, offline inference, async inference, bulk inference

Running many AI requests at once as a scheduled job rather than one at a time in real time. Batch inference is significantly cheaper than real-time calls and is the right choice when results do not need to be instant — think nightly data enrichment, bulk document processing, or large-scale eval runs.

Most discussions of LLM infrastructure assume real-time use: a user sends a message, the model responds in seconds. But many AI workloads are not time-sensitive. Enriching a database of customer records with AI-generated summaries, scoring thousands of support tickets for sentiment, or running an eval suite across many test cases — these all need results eventually, not immediately. Batch inference handles exactly this: you submit a large set of inputs, the system processes them when capacity is available, and you collect the results.

Batch inference providers like OpenAI's Batch API offer significant cost reductions (often 50% or more) relative to real-time API calls, in exchange for a turnaround time measured in hours rather than milliseconds. For high-volume data pipelines, this pricing difference can be the difference between a financially viable product and one where inference costs swallow the margin.

For builders, batch inference shows up most in LLMOps workflows: generating embeddings for large document sets, running quality evaluations across historical conversations, or enriching product catalog data with AI-generated descriptions. It pairs naturally with asynchronous task queues and is often where teams first realize that not every AI call needs to be real-time — and that treating them differently produces meaningful cost savings.

This definition is AI-generated and refreshed weekly. It may contain inaccuracies. Use your own judgment, especially for production decisions.

Related terms