Fast Heterogeneous Serving: Scalable Mixed-Scale LLM Allocation for SLO-Constrained Inference
This research explores scalable mixed-scale LLM allocation strategies that optimize model selection and GPU provisioning under strict latency and budget cons...