Energy-Aware Scheduling for Serverless LLM Serving on Shared GPUs

Tianyu Wang, Gourav Rattihalli, Aditya Dhakal, Longfei Shangguan 2026-06-30

The problem is the growing energy footprint of LLM inference in cloud clusters, which is exacerbated by serverless serving's elastic GPU sharing that creates conflicting resource demands under a single device-wide operating point. Festina introduces a profiling-guided, power-aware control plane that jointly coordinates request placement, SM partitioning, and GPU operating points to minimize cluster-wide energy while meeting TTFT/TBT SLOs. Experimental evidence shows Festina reduces energy consumption by up to 56% compared to four SOTA LLM serving systems and one DVFS-augmented system, while maintaining SLO attainment within a 2% margin. This matters because it demonstrates that energy-first scheduling can achieve substantial power savings without sacrificing performance, addressing a critical need for sustainable cloud infrastructure.

PDF