Characterizing Software Aging in GPU-Based LLM Serving Systems

Domenico Cotroneo, Bojan Cukic 2026-06-14

The paper addresses the problem of software aging in GPU-based LLM serving systems, which differ from traditional CPU-centric systems due to heterogeneous hardware and highly variable workloads. The method involves a 216-hour empirical campaign across six co-located deployments with identical stress, monitoring host, device, and client metrics and applying a statistical pipeline for autocorrelation and multiple testing. Experimental evidence shows statistically significant memory aging in all deployments, with leak rates strongly dependent on the serving runtime and configuration. This matters because it provides a reproducible framework bridging software aging and rejuvenation research with LLM serving, enabling future mitigation strategies.

PDF

GF-DiT: Scheduling Parallelism for Diffusion Transformer Serving

Xinwei Qiang, Yifan Hu, Shixuan Sun, Jing Yang 2026-06-14

The problem is that existing Diffusion Transformer (DiT) serving systems use static parallelism for each request, which is inefficient due to heterogeneity across requests, execution stages, and system conditions. GF-DiT introduces a policy-programmable runtime that dynamically adapts parallelism via an asynchronous execution abstraction and group-free collectives for low-overhead online GPU reallocation. Experimental evaluation in vLLM-Omni shows GF-DiT improves throughput by up to 6.01×, reduces mean latency by up to 95%, and lowers SLO violation rates by up to 90% compared to fixed-pipeline execution. This matters because it enables efficient, elastic DiT serving that treats GPU parallelism as a schedulable resource, significantly improving performance and service quality for image and video generation workloads.

PDF

ReSCom: A Reconfigurable Spiking Neural Network Accelerator Using Stochastic Computing

Ali Alipour Fereidani, Mohammad Rasoul Roshanshah, Saeed Safari 2026-06-14

ReSCom addresses the high power and area costs of Spiking Neural Network (SNN) hardware by introducing a reconfigurable accelerator that uses stochastic computing for multiplication while preserving exact fixed-point addition and subtraction. The method employs a unified neuron design supporting IF, LIF, and Synaptic models, enabling runtime trade-offs between accuracy, latency, and energy. On MNIST inference with a Xilinx Artix-7 FPGA, ReSCom achieves 92.80% accuracy at 0.05 mJ per image and 100 MHz, outperforming recent state-of-the-art implementations in energy efficiency. This matters because it demonstrates that stochastic computing can stabilize SNN inference while providing explicit, dynamic control over accuracy-latency-energy trade-offs for resource-constrained edge applications.

PDF