Daily | Yixun Hong

Eidola: Modeling Multi-GPU Network Communication Traffic in Distributed AI Workloads

Ranganath R. Selagamsetty, Matthew Poremba, Bradford M. Beckmann, Joshua San Miguel 2026-06-14

Gem5 Interconnect Microarchitecture Simulation HPC × Compiler Runtime × GPU

Eidola addresses the problem of modeling irregular and transient inter-GPU communication traffic in distributed AI workloads, which existing tools fail to capture due to fine-grained synchronization and peer-to-peer writes. The method introduces a scalable gem5 extension that uses annotated timing profiles from real applications to emulate peer-to-peer GPU writes with cycle-level precision. Experimental evidence demonstrates Eidola's effectiveness by reproducing variability in fused kernel execution and confirming reductions in polling-related memory traffic via a SyncMon-inspired mechanism. This matters because Eidola provides a flexible platform for architectural exploration of interconnect bandwidth and latency in modern multi-GPU systems.

PDF

On the Limits of Performance Portability in Directive-Based GPU Programming

Alessandro Romeo, Nitin Shukla, Stefano Truzzi, Alessio Suriano 2026-06-14

Exascale OpenMP Performance × Portability HPC × Compiler Runtime × GPU

The problem is that directive-based GPU programming faces fundamental trade-offs between performance, portability, and productivity when transitioning scientific applications to exascale systems. The method involved porting the production-grade magnetohydrodynamics code gPLUTO from OpenACC to OpenMP and evaluating its performance on NVIDIA A100 and AMD MI250X devices. Experimental evidence shows that while OpenACC and OpenMP achieve comparable performance on NVIDIA platforms, the same OpenMP implementation is approximately three times slower at the application level on AMD MI250X, with kernel-level slowdowns reaching up to 47x due to strided memory-access patterns, compiler limitations, and register pressure from C++ abstractions. This matters because it demonstrates that achieving portable performance across GPU architectures requires not only application-level changes but also continued advances in compiler backends and architecture-aware optimization strategies.

PDF

nomp: A Framework for Building Domain Specific Compilers

Thilina Ratnayaka, Kaushik Kulkarni, Nipuna Fernando, Pubudu Hewavitharana 2026-06-14

OpenMP CUDA HPC × Compiler Runtime ×

Problem: Existing GPU programming models force a trade-off between low-level performance and high-level productivity, with no single solution achieving all three goals of productivity, portability, and performance. Method: The authors propose nomp, a framework for building domain-specific compilers that uses a pragma-based programming model and a runtime for code transformation and generation based on user-provided metadata. Finding or experimental evidence: The abstract does not disclose experimental results. Why it matters: nomp aims to improve programmer productivity without sacrificing performance or portability by enabling reuse of domain-specific optimization patterns.

PDF

Work Stealing for the 2D-Mesh Topology of Satellite Constellations in Low Earth Orbit

Mia Reitz, Dorian Chenet, Jonas Posner 2026-06-14

High Performance × Computing HPC × Compiler Runtime ×

The problem is that existing Asynchronous Many-Task (AMT) runtimes assume a fully connected network with low, uniform latency, which is invalid for satellite constellations in Low Earth Orbit (LEO) that communicate via a sparse mesh topology. The method proposes a neighbor-only work stealing strategy where workers steal exclusively from directly connected neighbors to avoid multi-hop communication. Experimental evidence on an HPC cluster with an emulated mesh shows the neighbor-only strategy performs within ~2.2% of global stealing on both balanced and irregular workloads, and an analytical model indicates a growing latency advantage with constellation size. This matters because it demonstrates that neighbor-only stealing can match global stealing performance in emulated settings, suggesting it is a viable and potentially preferable approach for adapting AMT to Space Edge Computing (SEC) at scale.

PDF