Daily | Yixun Hong

WIDE: Boosting Adaptive LLM Inference via Token-level Dynamic Width Pruning

Haozhe Hu, Hao Wu, Peiran Yin, Chao Han 2026-07-31

Attention Inference LLM Training Hardware

WIDE introduces the first end-to-end differentiable token-level dynamic width pruning framework for LLMs, enabling fine-grained computation allocation by letting each token select attention-head and FFN-channel groups. The method uses a two-stage training pipeline and a pruning–kernel co-design that decomposes acceleration into mask reordering and block-level skipping for practical execution. At 50% sparsity, WIDE achieves a 55.1% performance boost over state-of-the-art dynamic depth pruning in calibration-only settings, with kernel-level speedups up to 1.98x for prefill and 4.95x for decoding, and end-to-end accelerations of 1.68x and 1.55x. This matters because it makes fine-grained dynamic pruning hardware-efficient, closing the gap between accuracy retention and real-world inference speedups for LLMs.

PDF

GyRot: Leveraging Hidden Synergy between Rotation and Fine-grained Group Quantization for Low-bit LLM Inference

Sangjin Kim, Yuseon Choi, Byeongcheol Kim, Jungjun Oh 2026-07-31

Inference LLM Quantization Accelerator Hardware Low Bit Tensor

Problem: Low-bit quantization for LLM inference struggles when combining rotation and fine-grained group quantization due to a mismatch between global rotation and localized group scaling, causing accuracy loss or hardware overhead. Method: GyRot proposes an algorithm-hardware co-design framework with Coarse Rotation, Fine Grouping (CoRFiG) and Harmonic-Aligned Permutation (HAP) to integrate rotation and group quantization, plus a zero-point rounding strategy for fully integer dequantization. Finding: On an INT4-based tensor PE architecture, GyRot achieves state-of-the-art 4-bit accuracy across LLaMA-family models, delivering up to 3.4x speedup and 3.6x energy efficiency over baseline LLM accelerators. Why it matters: This validates GyRot's practical effectiveness for scalable and energy-efficient LLM deployment, addressing a critical bottleneck in low-bit inference hardware.

PDF

A Photonic-CXL Memory Appliance for Scalable KV Cache Management in LLM Inference

Jing Ding, Yash Nishant, Chandrish Ambati, Jyothsna Kamati 2026-07-31

Cache LLM Inference GPU Architecture Simulation Workload

The paper addresses the memory wall in LLM inference, where KV cache demands tens of terabytes at hundreds of GB/s exceed current memory tier capabilities. The proposed Marvell Photonic Fabric Memory Appliance replaces electrical switches with a passive fiber shuffle in a switch-free full-crossbar topology, delivering 32 TB shared memory across 16 hosts via photonic-CXL hybrid architecture. Emulation results show over 50% latency reduction versus electrical CXL pools, while simulation demonstrates a 6.6x improvement in time-to-first-token by eliminating cache eviction cliffs for multi-turn workloads. This work matters because it enables practical TB-scale shared memory for concurrent long-context users, overcoming the scalability limits of electrical CXL pooling in real deployments.

PDF

Investigating reservoir computing for branch predictionin pipelined processors using emerging CMOS memristor devices

Harvey Samuel George Johnson, Sendy Phang 2026-07-31

Branch Prediction Physics App Ph Microarchitecture Simulation

Problem: The paper investigates reservoir computing (RC) as a novel approach for branch prediction in pipelined processors, targeting high-speed operation and integration with CMOS digital logic using emerging memristor devices. Method: A memristor-based RC design framework was developed and implemented in simulation using System Verilog and Verilog-AMS, then verified with a sequence detection task and benchmarked for branch prediction on the RISC-V RV64GC ISA using the Dhrystone benchmark. Finding: Testing shows RC achieves impressive overall prediction accuracy for branch prediction, but it is 15x slower to adapt to changes in branching behavior compared to the state-of-the-art TAGE predictor, indicating shortfalls in adaptability. Why it matters: This work demonstrates RC's promise for branch prediction while highlighting the need for further refinement to compete with existing predictors, potentially enabling faster and more efficient processor designs.

PDF

InferScale: GPU-Native KV Injection for Personalized LLM Serving

Peter Li, Prashant Pandey 2026-07-31

Memory Microarchitecture Simulation GPU Cache LLM Serving

InferScale is a GPU-native LLM memory system that replaces repeated prompt prefilling with reusable KV state, addressing the TTFT increase caused by injecting persistent personalized context into prompts. It precomputes KV representations for memory facts, stores them with semantic embeddings on the GPU, and injects them directly into vLLM's paged cache, using Chunked RoPE for dynamic memory assembly and Context-Window Encoding to mitigate the loss of cross-fact context. On LoCoMo with three open-weight models, InferScale keeps TTFT nearly constant as retrieval budget grows, reducing TTFT by 72-79% (3.6-4.8x) at k=50, achieving 60.3% accuracy versus 63.3% for Mem0, and delivering 3.7-4.5x throughput under concurrent load. This decouples memory-conditioned serving latency from retrieved-context size while preserving application quality, enabling efficient personalized LLM serving without engine modifications or fine-tuning.

PDF

Demystifying DRAM Read Disturbance: Bridging the Gap Between Experimental Characterization and Device-Level Modeling of RowHammer and RowPress Phenomena

Haocong Luo, Longda Zhou, Ataberk Olgun, İsmail Emir Yüksel 2026-07-31

Simulation Design

The paper addresses the gap between empirical DRAM read disturbance studies (RowHammer/RowPress) and device-level physical models, which fail to explain key observed bitflip behaviors. The authors systematically compare three fundamental metrics—bitflip directions, bitflip counts, and ACmin—against existing models, then run extensive TCAD simulations to reproduce real-chip phenomena. The simulations reveal updated error mechanisms and identify critical modeling parameters that determine whether results match hardware characterization, though the abstract does not disclose quantitative experimental results. This work provides a principled foundation for developing more accurate characterization methodologies and effective mitigations, directly impacting the security and reliability of DRAM-based systems.

PDF

Queue-Theoretic Admission Control for Multi-Tenant GPU Clusters

Sohan Kunkerkar 2026-07-31

GPU Workload Network

Problem: GPU cluster operators cannot predict workload admission wait times, and existing greedy heuristics lack formal guarantees. Method: We formalize admission as a multi-class, multi-resource queueing network, prove a structural decomposition into quotable and unfeasible workloads, and model quotable queues as M/G/k systems with effective server count from vector packing. Finding: We prove optimal admission ordering is NP-hard via vector bin packing, and validation on Kueue shows the vector k_eff identifies bottleneck resources, Little's Law holds exactly, and Erlang-C overestimates wait times conservatively. Why it matters: This provides the first formal wait-time bounds for multi-tenant GPU clusters, enabling predictable admission control despite NP-hard optimality.

PDF

Extended Depth-First Representations of $k^2$-trees

Gabriel Carmona, Paolo Ferragina, Giovanni Manzini, Francesco Tosoni 2026-07-31

Cache Workload Data

This paper addresses the problem of poor cache performance and memory locality in traditional level-wise $k^2$-tree layouts for static graph compression. The authors propose four depth-first representations—EDF-1, BP, CEDF, and CBP—along with a linear-time compression method using suffix and LCP arrays to merge identical subtrees. Experiments on Web Graphs, Wikidata, and random adjacency matrices show that CEDF achieves the best compression in most cases, EDF-1 and CEDF consistently reduce peak memory usage, and performance varies by workload across matrix-vector and matrix-matrix operations. These findings establish depth-first $k^2$-tree layouts as a practical, efficient alternative to traditional layouts, improving both compression and computational performance in linear-algebra operations.

PDF