Daily | Yixun Hong

Eidola: Modeling Multi-GPU Network Communication Traffic in Distributed AI Workloads

Ranganath R. Selagamsetty, Matthew Poremba, Bradford M. Beckmann, Joshua San Miguel 2026-06-14

Gem5 Interconnect Microarchitecture Simulation HPC × Compiler Runtime GPU

Eidola addresses the problem of modeling irregular and transient inter-GPU communication traffic in distributed AI workloads, which existing tools fail to capture due to fine-grained synchronization and peer-to-peer writes. The method introduces a scalable gem5 extension that uses annotated timing profiles from real applications to emulate peer-to-peer GPU writes with cycle-level precision. Experimental evidence demonstrates Eidola's effectiveness by reproducing variability in fused kernel execution and confirming reductions in polling-related memory traffic via a SyncMon-inspired mechanism. This matters because Eidola provides a flexible platform for architectural exploration of interconnect bandwidth and latency in modern multi-GPU systems.

PDF

nomp: A Framework for Building Domain Specific Compilers

Thilina Ratnayaka, Kaushik Kulkarni, Nipuna Fernando, Pubudu Hewavitharana 2026-06-14

OpenMP CUDA HPC × Compiler Runtime

Problem: Existing GPU programming models force a trade-off between low-level performance and high-level productivity, with no single solution achieving all three goals of productivity, portability, and performance. Method: The authors propose nomp, a framework for building domain-specific compilers that uses a pragma-based programming model and a runtime for code transformation and generation based on user-provided metadata. Finding or experimental evidence: The abstract does not disclose experimental results. Why it matters: nomp aims to improve programmer productivity without sacrificing performance or portability by enabling reuse of domain-specific optimization patterns.

PDF

Work Stealing for the 2D-Mesh Topology of Satellite Constellations in Low Earth Orbit

Mia Reitz, Dorian Chenet, Jonas Posner 2026-06-14

High Performance Computing HPC × Compiler Runtime

The problem is that existing Asynchronous Many-Task (AMT) runtimes assume a fully connected network with low, uniform latency, which is invalid for satellite constellations in Low Earth Orbit (LEO) that communicate via a sparse mesh topology. The method proposes a neighbor-only work stealing strategy where workers steal exclusively from directly connected neighbors to avoid multi-hop communication. Experimental evidence on an HPC cluster with an emulated mesh shows the neighbor-only strategy performs within ~2.2% of global stealing on both balanced and irregular workloads, and an analytical model indicates a growing latency advantage with constellation size. This matters because it demonstrates that neighbor-only stealing can match global stealing performance in emulated settings, suggesting it is a viable and potentially preferable approach for adapting AMT to Space Edge Computing (SEC) at scale.

PDF