OmniPilot: An Uncertainty-Aware LLM Inference Advisor for Heterogeneous GPU Clusters
OmniPilot addresses the challenge of selecting GPU type, tensor-parallel degree, and precision for LLM serving on heterogeneous clusters, where static configurations fail due to fluctuating throughput, launch success, and hardware failures. The method combines a conformally calibrated quantile cost model for eight serving targets with an out-of-distribution (OOD) abstention layer, ranking configurations by an economic utility metric. Across 460 benchmark runs on A100, H100, and H200 hardware, OmniPilot achieves 6.2% MAPE for throughput prediction, 95% top-1 accuracy, and 0.003 mean utility regret, while its abstention layer correctly flags all OOD scenarios. This matters because it enables robust, adaptive LLM deployment on shared clusters, reducing wasted node-hours and improving serving reliability.