Energy-Aware Scheduling for Serverless LLM Serving on Shared GPUs
The problem is the growing energy footprint of LLM inference in cloud clusters, which is exacerbated by serverless serving's elastic GPU sharing that creates conflicting resource demands under a single device-wide operating point. Festina introduces a profiling-guided, power-aware control plane that jointly coordinates request placement, SM partitioning, and GPU operating points to minimize cluster-wide energy while meeting TTFT/TBT SLOs. Experimental evidence shows Festina reduces energy consumption by up to 56% compared to four SOTA LLM serving systems and one DVFS-augmented system, while maintaining SLO attainment within a 2% margin. This matters because it demonstrates that energy-first scheduling can achieve substantial power savings without sacrificing performance, addressing a critical need for sustainable cloud infrastructure.