← Back to glossary
+Suggest a term
Concept·Business Models·Added 1 month ago

Inference cost

Also known as: inference economics, AI COGS, model serving cost, cost of inference

The compute cost your product incurs every time a model generates a response. Unlike traditional software where marginal cost is near zero, every AI query burns real GPU time and money, making cost management a core business concern.

In classic SaaS, the cost of serving one more user is nearly nothing once the software is built. AI flips that: every inference call, meaning every time your app asks the model to generate something, consumes compute that you pay for. Tokens in, tokens out, API fees, GPU time. These costs add up fast at scale, and they sit squarely in cost of goods sold (COGS), the direct costs of producing your product.

This is why AI companies typically run gross margins of 50 to 60%, well below the 80 to 90% margins traditional SaaS companies enjoy. If you price your product without accounting for inference costs carefully, you can scale to negative margins without realizing it. The math that works for ten customers may be catastrophic at a thousand.

Smart builders track inference cost per user, per feature, and per workflow from day one. Techniques like prompt caching (reusing computed results for repeated inputs), model routing (sending cheaper tasks to smaller models), and batching requests can significantly reduce inference spend. Inference costs have also dropped sharply across providers, which creates pricing pressure on raw token-based models but benefits builders who are managing their own cost stack.

This definition is AI-generated and refreshed weekly. It may contain inaccuracies. Use your own judgment, especially for production decisions.
Related terms
Usage-based pricingAI-native SaaSModel routerPrompt cachingToken