Daily | Yixun Hong

OmniPilot: An Uncertainty-Aware LLM Inference Advisor for Heterogeneous GPU Clusters

D. Balamurugan, Thomas W. Bush 2026-07-05

Inference Language Model LLM Quantization Training GPU Hardware

OmniPilot addresses the challenge of selecting GPU type, tensor-parallel degree, and precision for LLM serving on heterogeneous clusters, where static configurations fail due to fluctuating throughput, launch success, and hardware failures. The method combines a conformally calibrated quantile cost model for eight serving targets with an out-of-distribution (OOD) abstention layer, ranking configurations by an economic utility metric. Across 460 benchmark runs on A100, H100, and H200 hardware, OmniPilot achieves 6.2% MAPE for throughput prediction, 95% top-1 accuracy, and 0.003 mean utility regret, while its abstention layer correctly flags all OOD scenarios. This matters because it enables robust, adaptive LLM deployment on shared clusters, reducing wasted node-hours and improving serving reliability.

PDF

harvard-edge/cs249r_book

harvard-edge 2026-07-05

Cache GPU Hardware Inference Quantization Scheduling Simulation Artificial

harvard-edge/cs249r_book Python

Machine Learning Systems

26,644 3,174

Machine Learning Systems

Code

DeadPool: Resilient LLM Training with Hot-Swapping via Zero-Overhead Checkpoint

Haotian Xie, Junlin Chen, Mingkai Zheng, Lishan Yang 2026-07-05

Language Model LLM Training Hardware Runtime

DeadPool addresses the problem of high overhead and long recovery latency in LLM training fault tolerance by proposing a hot-swapping mechanism that replaces failed nodes with spare nodes without terminating the job. The method uses off-critical-path in-memory checkpointing for spatial redundancy and a communicator reconstruction protocol, overlapping checkpointing with computation to achieve zero overhead during error-free execution. Experimental evidence on up to 512 NVIDIA A100 GPUs and LLMs up to 65B parameters shows zero checkpoint overhead and hot-swapping recovery in under 40 seconds. This matters because it simultaneously eliminates failure-free overhead and minimizes recovery cost, enabling resilient large-scale LLM training.

PDF

High-Performance NTT Accelerators for PQC leveraging Unified Redundant Arithmetic and Fine-Tuned Microarchitecture

George Alexakis, Dimitrios Schoinianakis, Giorgos Dimitrakopoulos 2026-07-05

Microarchitecture Simulation Accelerator

The paper addresses the performance bottleneck of modular reduction and scaling overhead in NTT/INTT accelerators for lattice-based PQC schemes like ML-KEM and ML-DSA. The authors propose parallel iterative NTT/INTT accelerators using optimized unified butterfly units with a novel redundant number representation that eliminates conditional corrections and integrates inverse-transform scaling into existing hardware. FPGA-based experimental results demonstrate higher clock frequencies, reduced execution times, and competitive resource utilization compared to prior designs. This matters because it enables more efficient polynomial arithmetic for post-quantum cryptography and privacy-preserving applications, critical for future secure communication systems.

PDF