Daily | Yixun Hong

ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling

Sihwa Lee, Janghwan Lee, Donghoon Yoo, Jae Gon Kim 2026-06-14

Inference × Quantization Hardware ×

Problem: Large reasoning models (LRMs) incur high inference costs due to long reasoning traces, and directly applying NVFP4 low-precision quantization degrades reasoning accuracy while existing kernels fail to deliver latency benefits in small-batch autoregressive decoding. Method: ReSET proposes a step-aware temperature scaling method that estimates step-level uncertainty online using both token-level and step-level entropy signals, and introduces a CUDA-core small-M NVFP4 kernel for latency-critical decoding. Finding: ReSET improves NVFP4 reasoning accuracy by up to ~2 points over the NVFP4 baseline, and the custom kernel achieves up to 2.5× kernel-level speedup over NVFP4 vLLM and approximately 2× end-to-end decoding speedup over BF16. Why it matters: This work enables accurate and efficient low-precision inference for latency-critical LRM deployments, reducing computational and memory costs without sacrificing reasoning quality.

PDF

ITME: Inference Tiered Memory Expansion with Disaggregated CXL-Hybrid Memories

Hakbeom Jang, Younghoon Min, Sunwoong Kim, Taeyoung Ahn 2026-06-14

Inference ×

ITME addresses the problem of scaling shared context infrastructure for TB-scale LLM inference workloads beyond individual server capacity. The method leverages CXL-hybrid memory to provide massive, byte-addressable remote memory expansion, simplifying the software stack by eliminating complex software-level optimization. Experimental evidence from production-grade SK Hynix CMM and PCIe Gen5 NVMe SSDs, along with an FPGA prototype, shows up to a 35.7% throughput improvement over conventional CPU-offloading. This matters because ITME enables cost-efficient scaling of shared context layers for agentic and long-context LLMs by proactively managing data movement across the memory-storage hierarchy.

PDF

SupraSNN: Exploiting Synapse-Level Parallelism in Spiking Neural Network Accelerators through Co-Optimized Mapping and Scheduling

Seyed Sadra Ghavami, Mohammad Hossein Nikkhah, Mohammad Rasoul Roshanshah, Saeed Safari 2026-06-14

Parallelism Neural Network Accelerator Scheduling

The problem is that deploying Spiking Neural Networks (SNNs) on hardware is limited by the challenge of managing massive parallelism, analogous to the historical barrier of serial execution in processors. The method introduces SupraSNN, a superscalar-inspired hardware-software co-design framework that treats synaptic events as parallelizable micro-operations, using a Multi-Cast Tree, parallel Synapse Processing Units, and a Merge Tree with a unified Neuron Unit. Experimental evidence shows that on a Xilinx Zynq XC7Z020 FPGA, SupraSNN achieves 149 μs inference latency and 0.025 mJ per image for MNIST (93.44% accuracy), delivering 47.6% lower latency and 5.6× better energy efficiency than prior FPGA-based SNN accelerators. This matters because it demonstrates a practical path to high synapse-level parallelism and energy efficiency for SNN deployment, extending to recurrent SNNs on the Spiking Heidelberg Dataset.

PDF

ReSCom: A Reconfigurable Spiking Neural Network Accelerator Using Stochastic Computing

Ali Alipour Fereidani, Mohammad Rasoul Roshanshah, Saeed Safari 2026-06-14

Neural Network Accelerator

ReSCom addresses the high power and area costs of Spiking Neural Network (SNN) hardware by introducing a reconfigurable accelerator that uses stochastic computing for multiplication while preserving exact fixed-point addition and subtraction. The method employs a unified neuron design supporting IF, LIF, and Synaptic models, enabling runtime trade-offs between accuracy, latency, and energy. On MNIST inference with a Xilinx Artix-7 FPGA, ReSCom achieves 92.80% accuracy at 0.05 mJ per image and 100 MHz, outperforming recent state-of-the-art implementations in energy efficiency. This matters because it demonstrates that stochastic computing can stabilize SNN inference while providing explicit, dynamic control over accuracy-latency-energy trade-offs for resource-constrained edge applications.

PDF