Daily | Yixun Hong

ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling

Sihwa Lee, Janghwan Lee, Donghoon Yoo, Jae Gon Kim 2026-06-14

Inference Quantization Hardware

Problem: Large reasoning models (LRMs) incur high inference costs due to long reasoning traces, and directly applying NVFP4 low-precision quantization degrades reasoning accuracy while existing kernels fail to deliver latency benefits in small-batch autoregressive decoding. Method: ReSET proposes a step-aware temperature scaling method that estimates step-level uncertainty online using both token-level and step-level entropy signals, and introduces a CUDA-core small-M NVFP4 kernel for latency-critical decoding. Finding: ReSET improves NVFP4 reasoning accuracy by up to ~2 points over the NVFP4 baseline, and the custom kernel achieves up to 2.5× kernel-level speedup over NVFP4 vLLM and approximately 2× end-to-end decoding speedup over BF16. Why it matters: This work enables accurate and efficient low-precision inference for latency-critical LRM deployments, reducing computational and memory costs without sacrificing reasoning quality.

PDF

Structured Testbench Generation for LLM-Driven HDL Design and Verification-Oriented Data Curation

En-Ming Huang, Yu-Hung Kao, Ren-Hao Deng, Wei-Po Hsin 2026-06-14

LLM Design × Data

Problem: Automated testbench generation is a bottleneck in LLM-driven RTL workflows due to stochastic, costly, and low-coverage outputs from prompt-based methods. Method: STG (Structured Testbench Generation) exploits hardware design structure to produce deterministic testbenches. Finding: STG runs 720x faster than iterative LLM-based flows, achieves higher coverage, reduces false-pass verdicts, and is 11x faster and 127x more energy-efficient than LLM-based filtering on a single CPU core. Why it matters: STG enables rapid, reliable verification for LLM-driven design, improves RTL benchmarks by exposing faulty testbenches, and yields state-of-the-art distilled models with reduced node count.

PDF

Maestro: Workload-Aware Cross-Cluster Scheduling for LLM-Based Multi-Agent Systems

Jinghao Wang, Xiao Zhou, Xiaoyang Sun, Yihui Zhang 2026-06-14

Workload Scheduling LLM Agent

Maestro addresses the problem of high resource consumption and scheduling inefficiencies in deploying LLM-based multi-agent systems under strict GPU budgets. The method uses agent semantics to predict output length and memory usage, enabling hierarchical scheduling with dynamic model co-location, latency-aware routing, and workflow-aware prioritization. Experimental evidence shows Maestro reduces KV-reservation HBM by 67.2% and improves high-contention SLO attainment over EDF by 23.6 percentage points. This matters because it enables efficient, scalable deployment of complex multi-agent workflows in resource-constrained cloud environments.

PDF

SupraSNN: Exploiting Synapse-Level Parallelism in Spiking Neural Network Accelerators through Co-Optimized Mapping and Scheduling

Seyed Sadra Ghavami, Mohammad Hossein Nikkhah, Mohammad Rasoul Roshanshah, Saeed Safari 2026-06-14

Parallelism Neural Network Accelerator Scheduling

The problem is that deploying Spiking Neural Networks (SNNs) on hardware is limited by the challenge of managing massive parallelism, analogous to the historical barrier of serial execution in processors. The method introduces SupraSNN, a superscalar-inspired hardware-software co-design framework that treats synaptic events as parallelizable micro-operations, using a Multi-Cast Tree, parallel Synapse Processing Units, and a Merge Tree with a unified Neuron Unit. Experimental evidence shows that on a Xilinx Zynq XC7Z020 FPGA, SupraSNN achieves 149 μs inference latency and 0.025 mJ per image for MNIST (93.44% accuracy), delivering 47.6% lower latency and 5.6× better energy efficiency than prior FPGA-based SNN accelerators. This matters because it demonstrates a practical path to high synapse-level parallelism and energy efficiency for SNN deployment, extending to recurrent SNNs on the Spiking Heidelberg Dataset.

PDF

An LLM System for Autonomous Variational Quantum Circuit Design

Kenya Sakka, Wataru Mizukami, Kosuke Mitarai 2026-06-14

Language Model LLM Machine Learning Hardware Quant Ph

The problem is that designing high-performing quantum circuits remains heavily reliant on human expertise. The method introduces an autonomous agentic framework using LLMs with seven integrated components for iterative circuit design under explicit constraints. Experimental evidence shows the framework outperforms representative quantum feature maps on image classification and achieves competitive accuracy for molecular ground state estimation across seven molecules. This matters because it establishes LLM-driven agentic systems as a viable paradigm for automated quantum circuit design and demonstrates AI's role in iterative scientific optimization.

PDF

ReSCom: A Reconfigurable Spiking Neural Network Accelerator Using Stochastic Computing

Ali Alipour Fereidani, Mohammad Rasoul Roshanshah, Saeed Safari 2026-06-14

Neural Network Accelerator

ReSCom addresses the high power and area costs of Spiking Neural Network (SNN) hardware by introducing a reconfigurable accelerator that uses stochastic computing for multiplication while preserving exact fixed-point addition and subtraction. The method employs a unified neuron design supporting IF, LIF, and Synaptic models, enabling runtime trade-offs between accuracy, latency, and energy. On MNIST inference with a Xilinx Artix-7 FPGA, ReSCom achieves 92.80% accuracy at 0.05 mJ per image and 100 MHz, outperforming recent state-of-the-art implementations in energy efficiency. This matters because it demonstrates that stochastic computing can stabilize SNN inference while providing explicit, dynamic control over accuracy-latency-energy trade-offs for resource-constrained edge applications.

PDF

Specifying Hardware Communication as Programs

Ernest Ng, Nikil Shyamsunder, Francis Pham, Adrian Sampson 2026-06-14

Interconnect HPC Compiler Runtime Hardware Communication

The problem is that hardware testing requires separate driver and monitor programs for each protocol, leading to manual effort and inconsistency risks. The method proposes a DSL that specifies hardware communication protocols as succinct imperative programs, enabling a single specification to both drive and monitor transactions. The abstract does not disclose experimental results, but describes a tool that automatically infers transaction-level traces from waveforms using the DSL specification. This matters because it could eliminate redundant code and reduce bugs in hardware verification for protocols like Wishbone and AXI-Stream.

PDF