Filtered by: Hardware × AI × Clear all

Arbor: Tree Search as a Cognition Layer for Autonomous Agents

Neha Prakriya, Chaojun Hou, Zheng Gong, Huasha Zhao 2026-06-14

Arbor addresses the problem of autonomous optimization in large, stateful action spaces by introducing a multi-agent framework with structured tree search as a shared cognition layer. The method pairs an Orchestrator agent with a Critic agent in a checks-and-balances architecture, using an explicit search tree of scored hypotheses as working memory. Experimental evidence shows Arbor achieves up to 193% inference throughput-latency Pareto improvement over vendor-optimized baselines, while a single agent without the harness plateaus at +33% and crashes within hours. This matters because it enables fully autonomous, hardware-agnostic, and reproducible multi-day optimization campaigns across the full LLM inference stack.

PDF

Non-Parametric Dual-Manifold Mapping via 8-Bit Bounded Transformation Matrices: Challenging FP-centric Hardware Paradigms in Low-Energy AI

Lars Kopp 2026-06-14

The paper addresses the problem of high energy costs from floating-point arithmetic in deep learning hardware. It proposes a non-parametric, training-free framework using 8-bit signed integer transformation matrices and bitwise logic for dual-manifold mapping. Experimental evidence shows near-perfect reconstruction under 90% truncation sparsity and 20% random node destruction, demonstrating extreme holographic resilience. This matters because it challenges the necessity of dense, floating-point-centric GPU accelerators, enabling a shift toward low-energy neuromorphic edge-computing.

PDF

Partitioned Tags, Shared Data: Reconciling Strict Cache Isolation with Write-Shared Coherence

Kartik Ramkrishnan, Stephen McCamant, Antonia Zhai, Pen Chung Yew 2026-06-14

SCP solves the problem that write-shared coherence fails under strict cache partitioning, a decade-old barrier to deploying eviction-based side-channel defenses in secure shared-OS settings. The method partitions only the tags while sharing a single data pool, sizes the data pool to prevent capacity-driven cross-partition eviction, and routes writes to the LLC after a leakage threshold to mitigate coherence-based leakage. Experimental evidence from gem5 shows SCP mitigates Prime+Probe, Flush+Reload, and shared-writeable-line attacks to no better than random guessing, with a +2.8% LLC SRAM hardware cost and IPC within 0.3% of DAWG on SPEC CPU2017. This matters because SCP reconciles strict cache isolation with write-shared coherence, enabling secure partitioning without sacrificing performance or coherence correctness.

PDF

ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling

Sihwa Lee, Janghwan Lee, Donghoon Yoo, Jae Gon Kim 2026-06-14

Problem: Large reasoning models (LRMs) incur high inference costs due to long reasoning traces, and directly applying NVFP4 low-precision quantization degrades reasoning accuracy while existing kernels fail to deliver latency benefits in small-batch autoregressive decoding. Method: ReSET proposes a step-aware temperature scaling method that estimates step-level uncertainty online using both token-level and step-level entropy signals, and introduces a CUDA-core small-M NVFP4 kernel for latency-critical decoding. Finding: ReSET improves NVFP4 reasoning accuracy by up to ~2 points over the NVFP4 baseline, and the custom kernel achieves up to 2.5× kernel-level speedup over NVFP4 vLLM and approximately 2× end-to-end decoding speedup over BF16. Why it matters: This work enables accurate and efficient low-precision inference for latency-critical LRM deployments, reducing computational and memory costs without sacrificing reasoning quality.

PDF

ITME: Inference Tiered Memory Expansion with Disaggregated CXL-Hybrid Memories

Hakbeom Jang, Younghoon Min, Sunwoong Kim, Taeyoung Ahn 2026-06-14

ITME addresses the problem of scaling shared context infrastructure for TB-scale LLM inference workloads beyond individual server capacity. The method leverages CXL-hybrid memory to provide massive, byte-addressable remote memory expansion, simplifying the software stack by eliminating complex software-level optimization. Experimental evidence from production-grade SK Hynix CMM and PCIe Gen5 NVMe SSDs, along with an FPGA prototype, shows up to a 35.7% throughput improvement over conventional CPU-offloading. This matters because ITME enables cost-efficient scaling of shared context layers for agentic and long-context LLMs by proactively managing data movement across the memory-storage hierarchy.

PDF

Harnessing Routing Foresight for Micro-step-level MoE load balancing in RL Post-training

Yuming Zhou, Haoyang Li, Sheng Lin, Yanfeng Zhao 2026-06-14

ForeMoE addresses expert load imbalance in Mixture-of-Experts (MoE) models during reinforcement learning (RL) post-training, where existing step-level statistics fail due to high-frequency micro-step fluctuations. The method exploits foreseeable routing information from the rollout stage to proactively guide load balancing, using a hierarchical planner to decompose the NP-hard problem and a transfer engine for overlapped expert transfer. Evaluations on 64 GPUs show up to a 1.45× speedup over state-of-the-art RL post-training systems. This matters because it enables efficient scaling of MoE LLMs under the unique workload dynamics of RL post-training, a dominant paradigm in current LLM development.

PDF

Structured Testbench Generation for LLM-Driven HDL Design and Verification-Oriented Data Curation

En-Ming Huang, Yu-Hung Kao, Ren-Hao Deng, Wei-Po Hsin 2026-06-14

Problem: Automated testbench generation is a bottleneck in LLM-driven RTL workflows due to stochastic, costly, and low-coverage outputs from prompt-based methods. Method: STG (Structured Testbench Generation) exploits hardware design structure to produce deterministic testbenches. Finding: STG runs 720x faster than iterative LLM-based flows, achieves higher coverage, reduces false-pass verdicts, and is 11x faster and 127x more energy-efficient than LLM-based filtering on a single CPU core. Why it matters: STG enables rapid, reliable verification for LLM-driven design, improves RTL benchmarks by exposing faulty testbenches, and yields state-of-the-art distilled models with reduced node count.

PDF

SupraSNN: Exploiting Synapse-Level Parallelism in Spiking Neural Network Accelerators through Co-Optimized Mapping and Scheduling

Seyed Sadra Ghavami, Mohammad Hossein Nikkhah, Mohammad Rasoul Roshanshah, Saeed Safari 2026-06-14

The problem is that deploying Spiking Neural Networks (SNNs) on hardware is limited by the challenge of managing massive parallelism, analogous to the historical barrier of serial execution in processors. The method introduces SupraSNN, a superscalar-inspired hardware-software co-design framework that treats synaptic events as parallelizable micro-operations, using a Multi-Cast Tree, parallel Synapse Processing Units, and a Merge Tree with a unified Neuron Unit. Experimental evidence shows that on a Xilinx Zynq XC7Z020 FPGA, SupraSNN achieves 149 μs inference latency and 0.025 mJ per image for MNIST (93.44% accuracy), delivering 47.6% lower latency and 5.6× better energy efficiency than prior FPGA-based SNN accelerators. This matters because it demonstrates a practical path to high synapse-level parallelism and energy efficiency for SNN deployment, extending to recurrent SNNs on the Spiking Heidelberg Dataset.

PDF

Characterizing Software Aging in GPU-Based LLM Serving Systems

Domenico Cotroneo, Bojan Cukic 2026-06-14

The paper addresses the problem of software aging in GPU-based LLM serving systems, which differ from traditional CPU-centric systems due to heterogeneous hardware and highly variable workloads. The method involves a 216-hour empirical campaign across six co-located deployments with identical stress, monitoring host, device, and client metrics and applying a statistical pipeline for autocorrelation and multiple testing. Experimental evidence shows statistically significant memory aging in all deployments, with leak rates strongly dependent on the serving runtime and configuration. This matters because it provides a reproducible framework bridging software aging and rejuvenation research with LLM serving, enabling future mitigation strategies.

PDF

An LLM System for Autonomous Variational Quantum Circuit Design

Kenya Sakka, Wataru Mizukami, Kosuke Mitarai 2026-06-14

The problem is that designing high-performing quantum circuits remains heavily reliant on human expertise. The method introduces an autonomous agentic framework using LLMs with seven integrated components for iterative circuit design under explicit constraints. Experimental evidence shows the framework outperforms representative quantum feature maps on image classification and achieves competitive accuracy for molecular ground state estimation across seven molecules. This matters because it establishes LLM-driven agentic systems as a viable paradigm for automated quantum circuit design and demonstrates AI's role in iterative scientific optimization.

PDF

Attention by Synchronization in Coupled Oscillator Networks

Fabio Pasqualetti, Taosha Guo 2026-06-14

The problem is that softmax attention requires exponentiation and global reduction, which are energy-expensive on von Neumann hardware and lack a natural physical analog. The method replaces softmax with Kuramoto synchronization dynamics, where queries are fixed anchors on a sphere and free oscillators equilibrate to encode attention weights via cosine similarity. Experimental evidence shows that at oscillator dimension 2, oscillator attention outperforms softmax on keyword spotting (+1.00 pp) and subject-verb agreement (+5.27 pp), while on causal language modeling it closes the perplexity gap as dimension increases, from +11.09 to +2.98 on WikiText-2 and from +2.39 to +0.57 on TinyStories. This matters because it provides a mathematically grounded blueprint for accurate attention on energy-constrained physical substrates without requiring exponentiation or global reduction.

PDF

ReSCom: A Reconfigurable Spiking Neural Network Accelerator Using Stochastic Computing

Ali Alipour Fereidani, Mohammad Rasoul Roshanshah, Saeed Safari 2026-06-14

ReSCom addresses the high power and area costs of Spiking Neural Network (SNN) hardware by introducing a reconfigurable accelerator that uses stochastic computing for multiplication while preserving exact fixed-point addition and subtraction. The method employs a unified neuron design supporting IF, LIF, and Synaptic models, enabling runtime trade-offs between accuracy, latency, and energy. On MNIST inference with a Xilinx Artix-7 FPGA, ReSCom achieves 92.80% accuracy at 0.05 mJ per image and 100 MHz, outperforming recent state-of-the-art implementations in energy efficiency. This matters because it demonstrates that stochastic computing can stabilize SNN inference while providing explicit, dynamic control over accuracy-latency-energy trade-offs for resource-constrained edge applications.

PDF

Specifying Hardware Communication as Programs

Ernest Ng, Nikil Shyamsunder, Francis Pham, Adrian Sampson 2026-06-14

The problem is that hardware testing requires separate driver and monitor programs for each protocol, leading to manual effort and inconsistency risks. The method proposes a DSL that specifies hardware communication protocols as succinct imperative programs, enabling a single specification to both drive and monitor transactions. The abstract does not disclose experimental results, but describes a tool that automatically infers transaction-level traces from waveforms using the DSL specification. This matters because it could eliminate redundant code and reduce bugs in hardware verification for protocols like Wishbone and AXI-Stream.

PDF