Daily | Yixun Hong

Arbor: Tree Search as a Cognition Layer for Autonomous Agents

Neha Prakriya, Chaojun Hou, Zheng Gong, Huasha Zhao 2026-06-14

Search Cognition Autonomy Agents

Arbor addresses the problem of autonomous optimization in large, stateful action spaces by introducing a multi-agent framework with structured tree search as a shared cognition layer. The method pairs an Orchestrator agent with a Critic agent in a checks-and-balances architecture, using an explicit search tree of scored hypotheses as working memory. Experimental evidence shows Arbor achieves up to 193% inference throughput-latency Pareto improvement over vendor-optimized baselines, while a single agent without the harness plateaus at +33% and crashes within hours. This matters because it enables fully autonomous, hardware-agnostic, and reproducible multi-day optimization campaigns across the full LLM inference stack.

PDF

Eidola: Modeling Multi-GPU Network Communication Traffic in Distributed AI Workloads

Ranganath R. Selagamsetty, Matthew Poremba, Bradford M. Beckmann, Joshua San Miguel 2026-06-14

Gem5 Interconnect Microarchitecture Simulation HPC Compiler Runtime GPU

Eidola addresses the problem of modeling irregular and transient inter-GPU communication traffic in distributed AI workloads, which existing tools fail to capture due to fine-grained synchronization and peer-to-peer writes. The method introduces a scalable gem5 extension that uses annotated timing profiles from real applications to emulate peer-to-peer GPU writes with cycle-level precision. Experimental evidence demonstrates Eidola's effectiveness by reproducing variability in fused kernel execution and confirming reductions in polling-related memory traffic via a SyncMon-inspired mechanism. This matters because Eidola provides a flexible platform for architectural exploration of interconnect bandwidth and latency in modern multi-GPU systems.

PDF

Partitioned Tags, Shared Data: Reconciling Strict Cache Isolation with Write-Shared Coherence

Kartik Ramkrishnan, Stephen McCamant, Antonia Zhai, Pen Chung Yew 2026-06-14

Gem5 Microarchitecture Simulation Cache

SCP solves the problem that write-shared coherence fails under strict cache partitioning, a decade-old barrier to deploying eviction-based side-channel defenses in secure shared-OS settings. The method partitions only the tags while sharing a single data pool, sizes the data pool to prevent capacity-driven cross-partition eviction, and routes writes to the LLC after a leakage threshold to mitigate coherence-based leakage. Experimental evidence from gem5 shows SCP mitigates Prime+Probe, Flush+Reload, and shared-writeable-line attacks to no better than random guessing, with a +2.8% LLC SRAM hardware cost and IPC within 0.3% of DAWG on SPEC CPU2017. This matters because SCP reconciles strict cache isolation with write-shared coherence, enabling secure partitioning without sacrificing performance or coherence correctness.

PDF

ITME: Inference Tiered Memory Expansion with Disaggregated CXL-Hybrid Memories

Hakbeom Jang, Younghoon Min, Sunwoong Kim, Taeyoung Ahn 2026-06-14

Inference

ITME addresses the problem of scaling shared context infrastructure for TB-scale LLM inference workloads beyond individual server capacity. The method leverages CXL-hybrid memory to provide massive, byte-addressable remote memory expansion, simplifying the software stack by eliminating complex software-level optimization. Experimental evidence from production-grade SK Hynix CMM and PCIe Gen5 NVMe SSDs, along with an FPGA prototype, shows up to a 35.7% throughput improvement over conventional CPU-offloading. This matters because ITME enables cost-efficient scaling of shared context layers for agentic and long-context LLMs by proactively managing data movement across the memory-storage hierarchy.

PDF

Structured Testbench Generation for LLM-Driven HDL Design and Verification-Oriented Data Curation

En-Ming Huang, Yu-Hung Kao, Ren-Hao Deng, Wei-Po Hsin 2026-06-14

LLM Design Data

Problem: Automated testbench generation is a bottleneck in LLM-driven RTL workflows due to stochastic, costly, and low-coverage outputs from prompt-based methods. Method: STG (Structured Testbench Generation) exploits hardware design structure to produce deterministic testbenches. Finding: STG runs 720x faster than iterative LLM-based flows, achieves higher coverage, reduces false-pass verdicts, and is 11x faster and 127x more energy-efficient than LLM-based filtering on a single CPU core. Why it matters: STG enables rapid, reliable verification for LLM-driven design, improves RTL benchmarks by exposing faulty testbenches, and yields state-of-the-art distilled models with reduced node count.

PDF

On the Limits of Performance Portability in Directive-Based GPU Programming

Alessandro Romeo, Nitin Shukla, Stefano Truzzi, Alessio Suriano 2026-06-14

Exascale OpenMP Performance × Portability HPC Compiler Runtime GPU

The problem is that directive-based GPU programming faces fundamental trade-offs between performance, portability, and productivity when transitioning scientific applications to exascale systems. The method involved porting the production-grade magnetohydrodynamics code gPLUTO from OpenACC to OpenMP and evaluating its performance on NVIDIA A100 and AMD MI250X devices. Experimental evidence shows that while OpenACC and OpenMP achieve comparable performance on NVIDIA platforms, the same OpenMP implementation is approximately three times slower at the application level on AMD MI250X, with kernel-level slowdowns reaching up to 47x due to strided memory-access patterns, compiler limitations, and register pressure from C++ abstractions. This matters because it demonstrates that achieving portable performance across GPU architectures requires not only application-level changes but also continued advances in compiler backends and architecture-aware optimization strategies.

PDF

nomp: A Framework for Building Domain Specific Compilers

Thilina Ratnayaka, Kaushik Kulkarni, Nipuna Fernando, Pubudu Hewavitharana 2026-06-14

OpenMP CUDA HPC Compiler Runtime

Problem: Existing GPU programming models force a trade-off between low-level performance and high-level productivity, with no single solution achieving all three goals of productivity, portability, and performance. Method: The authors propose nomp, a framework for building domain-specific compilers that uses a pragma-based programming model and a runtime for code transformation and generation based on user-provided metadata. Finding or experimental evidence: The abstract does not disclose experimental results. Why it matters: nomp aims to improve programmer productivity without sacrificing performance or portability by enabling reuse of domain-specific optimization patterns.

PDF

GF-DiT: Scheduling Parallelism for Diffusion Transformer Serving

Xinwei Qiang, Yifan Hu, Shixuan Sun, Jing Yang 2026-06-14

Transformer GPU Runtime Scheduling Parallelism Serving

The problem is that existing Diffusion Transformer (DiT) serving systems use static parallelism for each request, which is inefficient due to heterogeneity across requests, execution stages, and system conditions. GF-DiT introduces a policy-programmable runtime that dynamically adapts parallelism via an asynchronous execution abstraction and group-free collectives for low-overhead online GPU reallocation. Experimental evaluation in vLLM-Omni shows GF-DiT improves throughput by up to 6.01×, reduces mean latency by up to 95%, and lowers SLO violation rates by up to 90% compared to fixed-pipeline execution. This matters because it enables efficient, elastic DiT serving that treats GPU parallelism as a schedulable resource, significantly improving performance and service quality for image and video generation workloads.

PDF

Work Stealing for the 2D-Mesh Topology of Satellite Constellations in Low Earth Orbit

Mia Reitz, Dorian Chenet, Jonas Posner 2026-06-14

High Performance × Computing HPC Compiler Runtime

The problem is that existing Asynchronous Many-Task (AMT) runtimes assume a fully connected network with low, uniform latency, which is invalid for satellite constellations in Low Earth Orbit (LEO) that communicate via a sparse mesh topology. The method proposes a neighbor-only work stealing strategy where workers steal exclusively from directly connected neighbors to avoid multi-hop communication. Experimental evidence on an HPC cluster with an emulated mesh shows the neighbor-only strategy performs within ~2.2% of global stealing on both balanced and irregular workloads, and an analytical model indicates a growing latency advantage with constellation size. This matters because it demonstrates that neighbor-only stealing can match global stealing performance in emulated settings, suggesting it is a viable and potentially preferable approach for adapting AMT to Space Edge Computing (SEC) at scale.

PDF