Skip to content

Model Serving

The infrastructure for hosting trained models and handling prediction requests at scale. Serving systems manage load balancing, batching, auto-scaling, and hardware allocation. Tools like vLLM, TGI, and Triton are popular for serving large language models.

Related terms

InferenceLatency (AI)Throughput
← Back to glossary