Fast Heterogeneous Serving: Scalable Mixed-Scale LLM Allocation for SLO-Constrained Inference

This research explores scalable mixed-scale LLM allocation strategies that optimize model selection and GPU provisioning under strict latency and budget cons...

Level: advanced

By Jiaming Cheng

Category: research