Daily | Yixun Hong

Eidola: Modeling Multi-GPU Network Communication Traffic in Distributed AI Workloads

Ranganath R. Selagamsetty, Matthew Poremba, Bradford M. Beckmann, Joshua San Miguel 2026-06-14

Gem5 Interconnect Microarchitecture Simulation HPC Compiler Runtime GPU

Eidola addresses the problem of modeling irregular and transient inter-GPU communication traffic in distributed AI workloads, which existing tools fail to capture due to fine-grained synchronization and peer-to-peer writes. The method introduces a scalable gem5 extension that uses annotated timing profiles from real applications to emulate peer-to-peer GPU writes with cycle-level precision. Experimental evidence demonstrates Eidola's effectiveness by reproducing variability in fused kernel execution and confirming reductions in polling-related memory traffic via a SyncMon-inspired mechanism. This matters because Eidola provides a flexible platform for architectural exploration of interconnect bandwidth and latency in modern multi-GPU systems.

PDF

Attention by Synchronization in Coupled Oscillator Networks

Fabio Pasqualetti, Taosha Guo 2026-06-14

Attention Language Model × Training Transformer Hardware Network ×

The problem is that softmax attention requires exponentiation and global reduction, which are energy-expensive on von Neumann hardware and lack a natural physical analog. The method replaces softmax with Kuramoto synchronization dynamics, where queries are fixed anchors on a sphere and free oscillators equilibrate to encode attention weights via cosine similarity. Experimental evidence shows that at oscillator dimension 2, oscillator attention outperforms softmax on keyword spotting (+1.00 pp) and subject-verb agreement (+5.27 pp), while on causal language modeling it closes the perplexity gap as dimension increases, from +11.09 to +2.98 on WikiText-2 and from +2.39 to +0.57 on TinyStories. This matters because it provides a mathematically grounded blueprint for accurate attention on energy-constrained physical substrates without requiring exponentiation or global reduction.

PDF

ReSCom: A Reconfigurable Spiking Neural Network Accelerator Using Stochastic Computing

Ali Alipour Fereidani, Mohammad Rasoul Roshanshah, Saeed Safari 2026-06-14

Neural Network × Accelerator

ReSCom addresses the high power and area costs of Spiking Neural Network (SNN) hardware by introducing a reconfigurable accelerator that uses stochastic computing for multiplication while preserving exact fixed-point addition and subtraction. The method employs a unified neuron design supporting IF, LIF, and Synaptic models, enabling runtime trade-offs between accuracy, latency, and energy. On MNIST inference with a Xilinx Artix-7 FPGA, ReSCom achieves 92.80% accuracy at 0.05 mJ per image and 100 MHz, outperforming recent state-of-the-art implementations in energy efficiency. This matters because it demonstrates that stochastic computing can stabilize SNN inference while providing explicit, dynamic control over accuracy-latency-energy trade-offs for resource-constrained edge applications.

PDF

Work Stealing for the 2D-Mesh Topology of Satellite Constellations in Low Earth Orbit

Mia Reitz, Dorian Chenet, Jonas Posner 2026-06-14

High Performance Computing HPC Compiler Runtime

The problem is that existing Asynchronous Many-Task (AMT) runtimes assume a fully connected network with low, uniform latency, which is invalid for satellite constellations in Low Earth Orbit (LEO) that communicate via a sparse mesh topology. The method proposes a neighbor-only work stealing strategy where workers steal exclusively from directly connected neighbors to avoid multi-hop communication. Experimental evidence on an HPC cluster with an emulated mesh shows the neighbor-only strategy performs within ~2.2% of global stealing on both balanced and irregular workloads, and an analytical model indicates a growing latency advantage with constellation size. This matters because it demonstrates that neighbor-only stealing can match global stealing performance in emulated settings, suggesting it is a viable and potentially preferable approach for adapting AMT to Space Edge Computing (SEC) at scale.

PDF