ITME: Inference Tiered Memory Expansion with Disaggregated CXL-Hybrid Memories

Hakbeom Jang, Younghoon Min, Sunwoong Kim, Taeyoung Ahn 2026-06-14

ITME addresses the problem of scaling shared context infrastructure for TB-scale LLM inference workloads beyond individual server capacity. The method leverages CXL-hybrid memory to provide massive, byte-addressable remote memory expansion, simplifying the software stack by eliminating complex software-level optimization. Experimental evidence from production-grade SK Hynix CMM and PCIe Gen5 NVMe SSDs, along with an FPGA prototype, shows up to a 35.7% throughput improvement over conventional CPU-offloading. This matters because ITME enables cost-efficient scaling of shared context layers for agentic and long-context LLMs by proactively managing data movement across the memory-storage hierarchy.

PDF

SupraSNN: Exploiting Synapse-Level Parallelism in Spiking Neural Network Accelerators through Co-Optimized Mapping and Scheduling

Seyed Sadra Ghavami, Mohammad Hossein Nikkhah, Mohammad Rasoul Roshanshah, Saeed Safari 2026-06-14

The problem is that deploying Spiking Neural Networks (SNNs) on hardware is limited by the challenge of managing massive parallelism, analogous to the historical barrier of serial execution in processors. The method introduces SupraSNN, a superscalar-inspired hardware-software co-design framework that treats synaptic events as parallelizable micro-operations, using a Multi-Cast Tree, parallel Synapse Processing Units, and a Merge Tree with a unified Neuron Unit. Experimental evidence shows that on a Xilinx Zynq XC7Z020 FPGA, SupraSNN achieves 149 μs inference latency and 0.025 mJ per image for MNIST (93.44% accuracy), delivering 47.6% lower latency and 5.6× better energy efficiency than prior FPGA-based SNN accelerators. This matters because it demonstrates a practical path to high synapse-level parallelism and energy efficiency for SNN deployment, extending to recurrent SNNs on the Spiking Heidelberg Dataset.

PDF