Filtered by: High × GPU × Clear all

Non-Parametric Dual-Manifold Mapping via 8-Bit Bounded Transformation Matrices: Challenging FP-centric Hardware Paradigms in Low-Energy AI

Lars Kopp 2026-06-14

The paper addresses the problem of high energy costs from floating-point arithmetic in deep learning hardware. It proposes a non-parametric, training-free framework using 8-bit signed integer transformation matrices and bitwise logic for dual-manifold mapping. Experimental evidence shows near-perfect reconstruction under 90% truncation sparsity and 20% random node destruction, demonstrating extreme holographic resilience. This matters because it challenges the necessity of dense, floating-point-centric GPU accelerators, enabling a shift toward low-energy neuromorphic edge-computing.

PDF

Harnessing Routing Foresight for Micro-step-level MoE load balancing in RL Post-training

Yuming Zhou, Haoyang Li, Sheng Lin, Yanfeng Zhao 2026-06-14

ForeMoE addresses expert load imbalance in Mixture-of-Experts (MoE) models during reinforcement learning (RL) post-training, where existing step-level statistics fail due to high-frequency micro-step fluctuations. The method exploits foreseeable routing information from the rollout stage to proactively guide load balancing, using a hierarchical planner to decompose the NP-hard problem and a transfer engine for overlapped expert transfer. Evaluations on 64 GPUs show up to a 1.45× speedup over state-of-the-art RL post-training systems. This matters because it enables efficient scaling of MoE LLMs under the unique workload dynamics of RL post-training, a dominant paradigm in current LLM development.

PDF

Maestro: Workload-Aware Cross-Cluster Scheduling for LLM-Based Multi-Agent Systems

Jinghao Wang, Xiao Zhou, Xiaoyang Sun, Yihui Zhang 2026-06-14

Maestro addresses the problem of high resource consumption and scheduling inefficiencies in deploying LLM-based multi-agent systems under strict GPU budgets. The method uses agent semantics to predict output length and memory usage, enabling hierarchical scheduling with dynamic model co-location, latency-aware routing, and workflow-aware prioritization. Experimental evidence shows Maestro reduces KV-reservation HBM by 67.2% and improves high-contention SLO attainment over EDF by 23.6 percentage points. This matters because it enables efficient, scalable deployment of complex multi-agent workflows in resource-constrained cloud environments.

PDF

Characterizing Software Aging in GPU-Based LLM Serving Systems

Domenico Cotroneo, Bojan Cukic 2026-06-14

The paper addresses the problem of software aging in GPU-based LLM serving systems, which differ from traditional CPU-centric systems due to heterogeneous hardware and highly variable workloads. The method involves a 216-hour empirical campaign across six co-located deployments with identical stress, monitoring host, device, and client metrics and applying a statistical pipeline for autocorrelation and multiple testing. Experimental evidence shows statistically significant memory aging in all deployments, with leak rates strongly dependent on the serving runtime and configuration. This matters because it provides a reproducible framework bridging software aging and rejuvenation research with LLM serving, enabling future mitigation strategies.

PDF

nomp: A Framework for Building Domain Specific Compilers

Thilina Ratnayaka, Kaushik Kulkarni, Nipuna Fernando, Pubudu Hewavitharana 2026-06-14

Problem: Existing GPU programming models force a trade-off between low-level performance and high-level productivity, with no single solution achieving all three goals of productivity, portability, and performance. Method: The authors propose nomp, a framework for building domain-specific compilers that uses a pragma-based programming model and a runtime for code transformation and generation based on user-provided metadata. Finding or experimental evidence: The abstract does not disclose experimental results. Why it matters: nomp aims to improve programmer productivity without sacrificing performance or portability by enabling reuse of domain-specific optimization patterns.

PDF