Eidola: Modeling Multi-GPU Network Communication Traffic in Distributed AI Workloads

Ranganath R. Selagamsetty, Matthew Poremba, Bradford M. Beckmann, Joshua San Miguel 2026-06-14

As distributed AI workloads grow in scale, multi-GPU systems have become essential for training large models. Although techniques like kernel fusion and overlapping communication with computation help reduce delays, they also introduce irregular and transient traffic patterns that are difficult to model using existing tools. These techniques rely heavily on fine-grained synchronization and peer-to-peer communication, which place significant pressure on interconnect bandwidth and latency. In this work, we introduce Eidola, a scalable extension to the gem5 simulation framework that enables detailed modeling of inter-GPU communication traffic. The extension is scalable as our GPU model serves as a succinct eidolon, emulating the minimal characteristics needed for traffic modeling. Eidola uses annotated timing profiles from real applications to emulate peer-to-peer GPU writes with cycle-level precision. This allows researchers to simulate and analyze synchronization behavior across large multi-GPU configurations. The simulator supports configurable per-GPU traffic patterns and enables isolated performance analysis under different communication scenarios. We demonstrate Eidola's effectiveness by reproducing variability in fused kernel execution and by implementing a SyncMon-inspired synchronization mechanism, confirming reductions in polling-related memory traffic. Our results show that Eidola provides a flexible and scalable platform for studying inter-GPU communication and supports architectural exploration in modern distributed GPU systems.

PDF

Arbor: Tree Search as a Cognition Layer for Autonomous Agents

Neha Prakriya, Chaojun Hou, Zheng Gong, Huasha Zhao 2026-06-14

Arbor is a multi-agent framework that introduces structured tree search as a cognition layer for autonomous agents operating in large, stateful action spaces. Prior autonomous optimization systems operate on isolated targets with stateless evaluation. Arbor instead maintains an explicit search tree of scored hypotheses that serves as the shared working memory across agents, evolving with every measurement, treating failures as diagnostic signal that reshapes subsequent exploration, and expanding as prior successes shift the bottleneck distribution. We validate Arbor on full-stack LLM inference optimization, a domain where achieving peak performance has historically required coordinated effort from engineering teams across the application, framework, compiler, kernel, and hardware stack. Arbor pairs an Orchestrator agent, which drives optimization by delegating to Domain Specialists across the inference stack, with a Critic agent that safeguards stability through root-cause analysis, introspection, and measurement validation -- a checks-and-balances architecture where neither agent can unilaterally drive the system. Agent capabilities are decomposed into hard skills (domain expertise) and soft skills (coordination protocols that determine how contributions compose), enabling fully autonomous multi-day campaigns. Arbor achieves up to 193% inference throughput-latency Pareto improvement over vendor-optimized baselines, while a single agent without the harness plateaus at +33% throughput improvement and crashes irrecoverably within hours. Arbor generalizes to multiple generations of hardware platform, and run-to-run variance is within 2 percentage points demonstrating that the method is hardware-agnostic and reproducible.

PDF

Maestro: Workload-Aware Cross-Cluster Scheduling for LLM-Based Multi-Agent Systems

Jinghao Wang, Xiao Zhou, Xiaoyang Sun, Yihui Zhang 2026-06-14

Large Language Model based Multi-Agent Systems (LLM-MAS) have emerged as a powerful paradigm for tackling complex tasks by breaking them into collaborative workflows of specialized LLM-powered agents. However, deploying such multi-agent workloads at scale poses significant system challenges. Each user query spawns an iterative pipeline of LLM calls, greatly amplifying resource consumption compared to single-turn queries. In resource-constrained cloud settings, these workflows face non-deterministic and input-dependent costs at decode stage, heavy-tailed multi-model requirements with memory fragmentation and over-provisioning, and cross-cluster scheduling trade-offs. We present Maestro, a workload-aware scheduling system designed for LLM-MAS serving under strict GPU budgets. Maestro explicitly leverages agent semantics and roles: it predicts the output length and memory usage of each stage and uses this prediction to drive a hierarchical scheduler. At the node level, Maestro enables dynamic multi-model co-location via hierarchical weight caching and elastic memory provisioning. At the cluster level, it performs latency-aware routing to avoid cold-start delays and memory overloads. At the global level, it enforces workflow-aware prioritization to minimize head-of-line blocking for interactive tasks. Across prototype experiments and trace-driven simulations, Maestro reduces KV-reservation HBM by 67.2% and improves high-contention SLO attainment over EDF by 23.6 percentage points.

PDF

LMCache/LMCache

LMCache 2026-06-14 8,985 Python
LMCache/LMCache Python
LMCache: Supercharge Your LLM with the Fastest KV Cache Layer
8,985 1,311

Problem: Researchers need reusable code for CUDA, ROCm, AMD, and Inference. Method: LMCache/LMCache packages Python implementation artifacts around CUDA, ROCm, AMD, and Inference, so it should be assessed through its code, examples, and linked papers. Finding: The available metadata points to CUDA, ROCm, AMD, and Inference as the main relevance signal. Why it matters: It gives the recommendation profile concrete repository feedback for future papers, tools, and architecture experiments.

Code

MiniMax Sparse Attention

Xunhao Lai, Weiqi Xu, Yufeng Yang, Qiaorui Chen 2026-06-14

Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persistent memory all require the model to jointly attend over hundreds of thousands to millions of tokens, yet the quadratic cost of softmax attention makes this untenable at deployment scale. We introduce MiniMax Sparse Attention (MSA), a blockwise sparse attention built upon Grouped Query Attention (GQA). A lightweight Index Branch scores key-value blocks and independently selects a Top-k subset for each GQA group, enabling group-specific sparse retrieval while maintaining efficient block-level execution; the Main Branch then performs exact block-sparse attention over only the selected blocks. Designed around a principle of simplicity and scalability, MSA is deliberately streamlined, making it straightforward to deploy efficiently across a broad range of GPUs. To translate sparsity into practical speedups, we co-design MSA with a GPU execution path that uses exp-free Top-k selection and KV-outer sparse attention to improve tensor-core utilization under block-granular access. On a 109B-parameter model with native multimodal training, MSA performs on par with GQA while reducing per-token attention compute by 28.4x at 1M context. Paired with our co-designed kernel, MSA achieves 14.2x prefill and 7.6x decoding wall-clock speedups on H800. Our inference kernel is available at: https://github.com/MiniMax-AI/MSA. A production-grade natively multimodal model powered by MSA has been publicly released at: https://huggingface.co/MiniMaxAI/MiniMax-M3.

PDF

Structured Testbench Generation for LLM-Driven HDL Design and Verification-Oriented Data Curation

En-Ming Huang, Yu-Hung Kao, Ren-Hao Deng, Wei-Po Hsin 2026-06-14

Automated testbench generation has become a critical bottleneck in large language model (LLM)-driven Register Transfer Level (RTL) workflows, where large numbers of candidate designs must be verified rapidly and reliably. Existing prompt-based approaches treat testbench generation as unconstrained code synthesis, yielding stochastic outputs with high token cost, low reproducibility, and insufficient coverage. To address this gap, we present STG, a Structured Testbench Generation framework that exploits the inherent structure of hardware designs to generate deterministic testbenches. As a direct verification tool, STG runs 720x faster than an iterative LLM-based testbench generation flow and higher rate of successful compilation, achieves higher coverage, and reduces false-pass verdicts on incorrect DUTs. STG also helps identify errors in RTL generation benchmarks by exposing faulty benchmark testbenches. As a data curation engine, it is 11x faster than LLM-based filtering on a single CPU core with 127x less energy, and the resulting distilled models provide state-of-the-art performance in our multi-benchmark evaluation. As a test-time scaling oracle, it reduces node count by 14-47\%. Our models are available at https://huggingface.co/collections/AS-SiliconMind/siliconmind-v12.

PDF

nomp: A Framework for Building Domain Specific Compilers

Thilina Ratnayaka, Kaushik Kulkarni, Nipuna Fernando, Pubudu Hewavitharana 2026-06-14

The low-level GPU programming models (CUDA, HIP, OpenCL, etc.) provide detailed control of the data flow and execution plan of a program in order to extract close-to-metal performance. However, these have a steep learning curve due to the intricacies of their syntax and semantics. This reduces programmer productivity. On the other hand, high-level models (OpenMP, OpenACC, etc.) that serve as abstractions over the low-level models are aimed at improving programmer productivity but achieving performance on-par with the low-level models is a challenge. There are inherent trade-offs between productivity, portability and performance in both approaches and there is no one-size-fits-all solution which achieves all three simultaneously. However, we believe there is room to improve programmer productivity without sacrificing performance and portability by reusing optimization patterns specific to a given domain. To this end, we propose nomp: a framework for building domain specific compilers. nomp consists of a pragma based programming model and a runtime capable of code transformation and generation based on user provided metadata.

PDF

ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling

Sihwa Lee, Janghwan Lee, Donghoon Yoo, Jae Gon Kim 2026-06-14

Large reasoning models (LRMs) improve complex problem-solving by generating long intermediate reasoning traces, but this substantially increases inference costs. NVFP4 inference offers a promising approach to reduce both computational and memory costs through hardware-supported low-precision execution. However, directly applying NVFP4 to LRMs introduces two practical limitations: reasoning accuracy degrades under quantization, and existing NVFP4 kernels do not fully realize latency benefits in small-batch autoregressive decoding. In this work, we analyze the effect of NVFP4 quantization on token-level uncertainty during reasoning. We show that quantization increases incorrect sampling at low-entropy symbolic tokens, while causing over-concentration on a small set of tokens in high-uncertainty reasoning steps. Based on this observation, we propose \textbf{ReSET}, a reasoning-step entropy-based temperature-scaling method that estimates step-level uncertainty online and adapts the decoding temperature using both token-level and step-level entropy signals. To address the latency gap, we further design a CUDA-core small-$M$ NVFP4 kernel for latency-critical autoregressive decoding. Across reasoning benchmarks and model scales, ReSET improves NVFP4 reasoning accuracy by up to $\sim\!$2 points over the NVFP4 baseline. Our CUDA-core small-$M$ kernel further improves latency-critical decoding, delivering up to $2.5\!\times$ kernel-level speedup over NVFP4 vLLM and approximately $2\!\times$ end-to-end decoding speedup over BF16. Code is available at https://github.com/aiha-lab/ReSET.

PDF

Partitioned Tags, Shared Data: Reconciling Strict Cache Isolation with Write-Shared Coherence

Kartik Ramkrishnan, Stephen McCamant, Antonia Zhai, Pen Chung Yew 2026-06-14

Cache partitioning is among the strongest structural defenses against eviction-based cache side channels, yet a decade-old design issue has blocked its widespread deployment in secure shared-OS settings. The issue is that write-shared coherence collapses under strict partitioning. We present SCP (Secure and Coherent Partitioning), which combines strict eviction isolation with write-shared coherence by partitioning only the tags, sharing a single data pool, and sizing the data pool so capacity-driven cross-partition eviction cannot occur. Timing obfuscation extends protections to the inter-partition lookup path. Coherence-based leakage on shared-writeable lines is mitigated by routing those writes through to the LLC once a leakage threshold is crossed, which makes attacker write probe latency independent of victim activity. Using gem5 for implementation, SCP mitigates Prime+Probe and Flush+Reload, which are the basis for more sophisticated cache attacks. We also demonstrate that a shared-writeable-line attack is mitigated. All these attacks yield results no better than random guessing. SCP's hardware cost is a modest +2.8% LLC SRAM. Performance matches DAWG within 0.3% IPC on the SPEC CPU2017 benchmarks that we evaluated. Sharing-intensive microbenchmarks demonstrate a tunable security-performance tradeoff based on a system-specified leakage threshold.

PDF

SupraSNN: Exploiting Synapse-Level Parallelism in Spiking Neural Network Accelerators through Co-Optimized Mapping and Scheduling

Seyed Sadra Ghavami, Mohammad Hossein Nikkhah, Mohammad Rasoul Roshanshah, Saeed Safari 2026-06-14

Spiking Neural Networks (SNNs) offer a brain-inspired path toward highly efficient computation, but their practical deployment is constrained by the challenge of managing and executing their massive parallelism on physical hardware. This problem mirrors the historical challenge in processor design of moving beyond serial execution, a barrier broken by superscalar architectures that dispatch multiple instructions to parallel functional units. Drawing inspiration from this paradigm, we introduce a hardware-software co-design framework that treats synaptic events as parallelizable micro-operations. We present SupraSNN, a superscalar-inspired architecture that achieves high synapse-level parallelism by physically decoupling synaptic and neuronal computations. Within this architecture, a Multi-Cast Tree routes spike data to multiple parallel Synapse Processing Units serve as the computational pipelines, while a Merge Tree consolidates distributed results for processing by a unified Neuron Unit--deliberately centralizing complex neuron state dynamics to mitigate hardware overhead and resource duplication. The efficacy of this architecture is enabled by a sophisticated partitioning and scheduling framework that first maps the SNN onto hardware respecting memory constraints, then heuristic scheduling determines the synaptic execution order, maximizing throughput and resource utilization. Implementing a feedforward SNN trained on MNIST (93.44% accuracy), SupraSNN achieves 149 $μs$ inference latency and 0.025 mJ per image (0.276 nJ per synapse) on the Xilinx Zynq XC7Z020 FPGA--delivering 47.6% lower latency and 5.6$\times$ better energy efficiency than prior FPGA-based SNN accelerators. Beyond vision tasks, a recurrent SNN on the Spiking Heidelberg Dataset (71.82% accuracy) achieves 1.41 ms latency and 0.77 mJ per sample on XC7Z030.

PDF

Characterizing Software Aging in GPU-Based LLM Serving Systems

Domenico Cotroneo, Bojan Cukic 2026-06-14

This paper proposes an empirical methodology to study software aging in GPU-based LLM serving systems. Traditional aging studies focus on CPU-centric software with relatively regular workloads; LLM serving is different, spanning a Python host and a CUDA device, handling requests whose cost varies by orders of magnitude, and relying on rapidly evolving software stacks. We run a 216-hour campaign across six co-located deployments under identical stress conditions, monitor host, device, and client metrics in parallel, and apply a statistical pipeline that accounts for autocorrelation and multiple testing. Our results reveal statistically significant memory aging in all deployments, with leak rates strongly dependent on the serving runtime and deployment configuration. Beyond these findings, we provide a reproducible framework that opens a research direction at the intersection of the software aging and rejuvenation and LLM serving communities.

PDF

Harnessing Routing Foresight for Micro-step-level MoE load balancing in RL Post-training

Yuming Zhou, Haoyang Li, Sheng Lin, Yanfeng Zhao 2026-06-14

Mixture-of-Experts (MoE) and reinforcement learning (RL) post-training now dominate large language model (LLM) development, yet expert load imbalance remains a critical challenge. Existing load-balancing systems target pre-training by relying on historical step-level statistics. However, these methods fail under the unique workload dynamics of RL post-training: the step-level load is stable, but the tiny batch sizes processed during micro-steps cause severe, high-frequency load fluctuations. We introduce ForeMoE, a micro-step-level load balancing system for MoE RL post-training. Instead of relying on historical statistics, ForeMoE exploits the multi-stage RL pipeline (rollout, recompute, policy update) by using foreseeable routing information from the rollout stage to proactively guide load balancing in the remaining stages. To support frequent per-micro-step reconfiguration, ForeMoE employs a hierarchical planner that decomposes the NP-hard load balancing problem into tractable sub-components, alongside a transfer engine that leverages complementary hardware paths (CPU-assisted and GPU-direct) for overlapped expert transfer. Evaluations on 64 GPUs demonstrate that ForeMoE achieves up to a 1.45$\times$ speedup over state-of-the-art RL post-training systems.

PDF

GF-DiT: Scheduling Parallelism for Diffusion Transformer Serving

Xinwei Qiang, Yifan Hu, Shixuan Sun, Jing Yang 2026-06-14

Diffusion Transformers (DiTs) have become the dominant architecture for image and video generation, creating growing demand for efficient DiT serving. Existing systems assign each request a fixed parallel configuration throughout its lifetime. However, DiT workloads exhibit substantial heterogeneity across requests, execution stages, and system conditions, making static parallelism inefficient and often leading to poor GPU utilization and degraded service quality. This paper argues that DiT serving should treat GPU parallelism as a first-class schedulable resource. We present GF-DiT, a policy-programmable runtime for elastic DiT serving that dynamically adapts the parallelism of running requests according to workload demands and service objectives. GF-DiT introduces an asynchronous execution abstraction that decomposes requests into independently schedulable trajectory tasks and enables online GPU reallocation. To make elastic parallelism practical, GF-DiT further proposes group-free collectives, a lightweight communication abstraction that supports low-overhead online formation and reconfiguration of arbitrary execution groups. We implement GF-DiT in vLLM-Omni and evaluate it on representative image and video diffusion workloads. Compared with fixed-pipeline execution with static parallelism, GF-DiT improves throughput by up to 6.01$\times$, reduces mean latency by up to 95%, lowers SLO violation rates by up to 90%, and reduces communication-group setup overhead from 778 ms to approximately 60 $μ$s.

PDF

ReSCom: A Reconfigurable Spiking Neural Network Accelerator Using Stochastic Computing

Ali Alipour Fereidani, Mohammad Rasoul Roshanshah, Saeed Safari 2026-06-14

Spiking Neural Networks (SNNs) provide an attractive framework for energy-efficient inference due to their event-driven computation and biologically inspired dynamics. However, efficient hardware realization of SNNs remains challenging because neuronal computations incur significant power and area costs, and uncontrolled approximate arithmetic can destabilize recurrent state updates when precision is not properly managed. To address these challenges, this paper presents ReSCom, a reconfigurable SNN accelerator that leverages stochastic computing to reduce hardware complexity while maintaining stable inference. The proposed architecture employs stochastic arithmetic for multiplication operations in neuron dynamics, while preserving exact fixed-point addition/subtraction operations. This stochastic strategy enables runtime trade-offs between accuracy, latency, and energy consumption. A unified reconfigurable neuron design supports Integrate-and-Fire (IF), Leaky Integrate-and-Fire (LIF), and Synaptic neuron models within a single hardware framework. Experimental results for MNIST inference on a Xilinx Artix-7 FPGA show that ReSCom achieves $92.80\%$ classification accuracy while consuming just $0.05~\mathrm{mJ}$ of operational energy per image at $100~\mathrm{MHz}$, outperforming the energy efficiency of recent state-of-the-art implementations. Furthermore, managing the stochastic bit-stream length allows explicit, dynamic control over accuracy-latency-energy trade-offs to meet target application constraints.

PDF

Non-Parametric Dual-Manifold Mapping via 8-Bit Bounded Transformation Matrices: Challenging FP-centric Hardware Paradigms in Low-Energy AI

Lars Kopp 2026-06-14

Modern deep learning hardware paradigms rely heavily on computationally expensive floating-point arithmetic (FP32, FP16, and FP8), requiring massive thermal and energetic overheads to maintain gradient-based optimization. This paper introduces a non-parametric, training-free computational framework for dual-manifold mapping that operates strictly within an 8-bit signed integer boundary and leverages simple bitwise and accumulation logic. By mapping a Spatial Manifold (N_spatial = 8192 neurons) and a Gabor-pooled Structural Manifold (N_structural = 4096 neurons) through an integer-based transformation matrix (Z-matrix), we eliminate the need for floating-point multipliers. Inference is achieved via cache-friendly pointer offsets and bitwise masks, accumulating directional sign-charges using fixed thresholds (theta_reject = 8.0, theta_cut = 2.0). Learning is executed through a localized, bounded update mechanism restricted strictly within [-127, 127], modulated by stochastic noise injection. Both architectures demonstrate extreme holographic resilience, preserving near-perfect reconstruction via a global scaling factor under 90% truncation sparsity and 20% random node destruction. By reducing core AI inference to 8-bit boundaries and boolean-like execution, this framework outlines a paradigm shift toward neuromorphic edge-computing, directly questioning the long-term necessity of dense, floating-point-centric GPU accelerators.

PDF

On the Limits of Performance Portability in Directive-Based GPU Programming

Alessandro Romeo, Nitin Shukla, Stefano Truzzi, Alessio Suriano 2026-06-14

The transition of scientific applications to GPU-accelerated exascale systems is constrained by trade-offs between performance, portability, and productivity. This work evaluates the performance portability of directive-based GPU programming by porting gPLUTO, a production-grade magnetohydrodynamics code for astrophysical simulations, from OpenACC to OpenMP, and analyzing its performance on NVIDIA A100 (Leonardo Booster) and AMD MI250X (LUMI-G) devices. On NVIDIA platforms, OpenACC and OpenMP achieve comparable performance due to a shared compiler backend, providing a consistent baseline for assessing algorithmic efficiency. In contrast, the same OpenMP implementation is approximately three times slower at the application level on AMD MI250X with respect to the NVIDIA A100 OpenACC baseline, with kernel-level slowdowns reaching up to an order of magnitude, driven by sensitivity to strided memory-access patterns and compiler limitations. Kernel-level profiling shows that the dominant contributors to run-time are memory-latency-bound rather than limited by peak band-width. In low-parallelism kernels, C++ abstraction layers increase register pressure and spilling, leading to extreme slowdowns of up to 47x in specific cases. These results indicate that portable performance across GPU architectures requires not only application-level changes but also continued advances in compiler backends and architecture-aware optimization strategies

PDF