Concept·Infrastructure·Added 2 months ago

Fine-tuned model deployment

Also known as: custom model deployment, fine-tune serving, LoRA serving, adapter serving

The infrastructure work of taking a model you have fine-tuned (trained further on your own data) and serving it in production. Fine-tuned models add complexity because you now manage both the base model weights and your custom adaptations.

Fine-tuning a model is one thing; making that fine-tuned model available to your application is another challenge. Unlike a base model you can pull from a provider's catalog, a fine-tuned model is custom — it lives as a set of modified weights (the numerical values that define the model) or adapters (small additional layers added on top of a base model, common in the LoRA fine-tuning approach). Serving it requires infrastructure that can load and host those custom weights.

For full fine-tunes (where all the model's weights are modified), deployment looks similar to self-hosting any open-weight model — you load the weights onto a GPU and serve them through a framework like vLLM. For LoRA adapters (lighter modifications that layer on top of a base model), some serving systems support running many adapters on a single base model simultaneously, which is far more efficient than running separate model instances for each variant.

Providers like Together AI, Fireworks AI, and Hugging Face Inference Endpoints support hosting fine-tuned models without full self-management. For teams building products that require behavioral customization beyond what prompting can achieve, understanding fine-tuned model deployment separates teams that can iterate rapidly from teams stuck waiting on infrastructure work.

This definition is AI-generated and refreshed weekly. It may contain inaccuracies. Use your own judgment, especially for production decisions.

Related terms