Daily | Yixun Hong

ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling

Sihwa Lee, Janghwan Lee, Donghoon Yoo, Jae Gon Kim 2026-06-14

Inference Quantization Hardware ×

Problem: Large reasoning models (LRMs) incur high inference costs due to long reasoning traces, and directly applying NVFP4 low-precision quantization degrades reasoning accuracy while existing kernels fail to deliver latency benefits in small-batch autoregressive decoding. Method: ReSET proposes a step-aware temperature scaling method that estimates step-level uncertainty online using both token-level and step-level entropy signals, and introduces a CUDA-core small-M NVFP4 kernel for latency-critical decoding. Finding: ReSET improves NVFP4 reasoning accuracy by up to ~2 points over the NVFP4 baseline, and the custom kernel achieves up to 2.5× kernel-level speedup over NVFP4 vLLM and approximately 2× end-to-end decoding speedup over BF16. Why it matters: This work enables accurate and efficient low-precision inference for latency-critical LRM deployments, reducing computational and memory costs without sacrificing reasoning quality.

PDF

ITME: Inference Tiered Memory Expansion with Disaggregated CXL-Hybrid Memories

Hakbeom Jang, Younghoon Min, Sunwoong Kim, Taeyoung Ahn 2026-06-14

Inference

ITME addresses the problem of scaling shared context infrastructure for TB-scale LLM inference workloads beyond individual server capacity. The method leverages CXL-hybrid memory to provide massive, byte-addressable remote memory expansion, simplifying the software stack by eliminating complex software-level optimization. Experimental evidence from production-grade SK Hynix CMM and PCIe Gen5 NVMe SSDs, along with an FPGA prototype, shows up to a 35.7% throughput improvement over conventional CPU-offloading. This matters because ITME enables cost-efficient scaling of shared context layers for agentic and long-context LLMs by proactively managing data movement across the memory-storage hierarchy.

PDF

Harnessing Routing Foresight for Micro-step-level MoE load balancing in RL Post-training

Yuming Zhou, Haoyang Li, Sheng Lin, Yanfeng Zhao 2026-06-14

Language Model × LLM Training GPU Hardware ×

ForeMoE addresses expert load imbalance in Mixture-of-Experts (MoE) models during reinforcement learning (RL) post-training, where existing step-level statistics fail due to high-frequency micro-step fluctuations. The method exploits foreseeable routing information from the rollout stage to proactively guide load balancing, using a hierarchical planner to decompose the NP-hard problem and a transfer engine for overlapped expert transfer. Evaluations on 64 GPUs show up to a 1.45× speedup over state-of-the-art RL post-training systems. This matters because it enables efficient scaling of MoE LLMs under the unique workload dynamics of RL post-training, a dominant paradigm in current LLM development.

PDF

Structured Testbench Generation for LLM-Driven HDL Design and Verification-Oriented Data Curation

En-Ming Huang, Yu-Hung Kao, Ren-Hao Deng, Wei-Po Hsin 2026-06-14

LLM Design Data

Problem: Automated testbench generation is a bottleneck in LLM-driven RTL workflows due to stochastic, costly, and low-coverage outputs from prompt-based methods. Method: STG (Structured Testbench Generation) exploits hardware design structure to produce deterministic testbenches. Finding: STG runs 720x faster than iterative LLM-based flows, achieves higher coverage, reduces false-pass verdicts, and is 11x faster and 127x more energy-efficient than LLM-based filtering on a single CPU core. Why it matters: STG enables rapid, reliable verification for LLM-driven design, improves RTL benchmarks by exposing faulty testbenches, and yields state-of-the-art distilled models with reduced node count.

PDF

An LLM System for Autonomous Variational Quantum Circuit Design

Kenya Sakka, Wataru Mizukami, Kosuke Mitarai 2026-06-14

Language Model × LLM Machine Learning Hardware × Quant Ph

The problem is that designing high-performing quantum circuits remains heavily reliant on human expertise. The method introduces an autonomous agentic framework using LLMs with seven integrated components for iterative circuit design under explicit constraints. Experimental evidence shows the framework outperforms representative quantum feature maps on image classification and achieves competitive accuracy for molecular ground state estimation across seven molecules. This matters because it establishes LLM-driven agentic systems as a viable paradigm for automated quantum circuit design and demonstrates AI's role in iterative scientific optimization.

PDF

Attention by Synchronization in Coupled Oscillator Networks

Fabio Pasqualetti, Taosha Guo 2026-06-14

Attention Language Model × Training Transformer Hardware × Network

The problem is that softmax attention requires exponentiation and global reduction, which are energy-expensive on von Neumann hardware and lack a natural physical analog. The method replaces softmax with Kuramoto synchronization dynamics, where queries are fixed anchors on a sphere and free oscillators equilibrate to encode attention weights via cosine similarity. Experimental evidence shows that at oscillator dimension 2, oscillator attention outperforms softmax on keyword spotting (+1.00 pp) and subject-verb agreement (+5.27 pp), while on causal language modeling it closes the perplexity gap as dimension increases, from +11.09 to +2.98 on WikiText-2 and from +2.39 to +0.57 on TinyStories. This matters because it provides a mathematically grounded blueprint for accurate attention on energy-constrained physical substrates without requiring exponentiation or global reduction.

PDF

ReSCom: A Reconfigurable Spiking Neural Network Accelerator Using Stochastic Computing

Ali Alipour Fereidani, Mohammad Rasoul Roshanshah, Saeed Safari 2026-06-14

Neural Network Accelerator

ReSCom addresses the high power and area costs of Spiking Neural Network (SNN) hardware by introducing a reconfigurable accelerator that uses stochastic computing for multiplication while preserving exact fixed-point addition and subtraction. The method employs a unified neuron design supporting IF, LIF, and Synaptic models, enabling runtime trade-offs between accuracy, latency, and energy. On MNIST inference with a Xilinx Artix-7 FPGA, ReSCom achieves 92.80% accuracy at 0.05 mJ per image and 100 MHz, outperforming recent state-of-the-art implementations in energy efficiency. This matters because it demonstrates that stochastic computing can stabilize SNN inference while providing explicit, dynamic control over accuracy-latency-energy trade-offs for resource-constrained edge applications.

PDF