Daily | Yixun Hong

A Photonic-CXL Memory Appliance for Scalable KV Cache Management in LLM Inference

Jing Ding, Yash Nishant, Chandrish Ambati, Jyothsna Kamati 2026-07-31

Cache LLM Inference GPU Architecture Simulation × Workload

The paper addresses the memory wall in LLM inference, where KV cache demands tens of terabytes at hundreds of GB/s exceed current memory tier capabilities. The proposed Marvell Photonic Fabric Memory Appliance replaces electrical switches with a passive fiber shuffle in a switch-free full-crossbar topology, delivering 32 TB shared memory across 16 hosts via photonic-CXL hybrid architecture. Emulation results show over 50% latency reduction versus electrical CXL pools, while simulation demonstrates a 6.6x improvement in time-to-first-token by eliminating cache eviction cliffs for multi-turn workloads. This work matters because it enables practical TB-scale shared memory for concurrent long-context users, overcoming the scalability limits of electrical CXL pooling in real deployments.

PDF

Investigating reservoir computing for branch predictionin pipelined processors using emerging CMOS memristor devices

Harvey Samuel George Johnson, Sendy Phang 2026-07-31

Branch Prediction Physics App Ph Microarchitecture Simulation ×

Problem: The paper investigates reservoir computing (RC) as a novel approach for branch prediction in pipelined processors, targeting high-speed operation and integration with CMOS digital logic using emerging memristor devices. Method: A memristor-based RC design framework was developed and implemented in simulation using System Verilog and Verilog-AMS, then verified with a sequence detection task and benchmarked for branch prediction on the RISC-V RV64GC ISA using the Dhrystone benchmark. Finding: Testing shows RC achieves impressive overall prediction accuracy for branch prediction, but it is 15x slower to adapt to changes in branching behavior compared to the state-of-the-art TAGE predictor, indicating shortfalls in adaptability. Why it matters: This work demonstrates RC's promise for branch prediction while highlighting the need for further refinement to compete with existing predictors, potentially enabling faster and more efficient processor designs.

PDF

InferScale: GPU-Native KV Injection for Personalized LLM Serving

Peter Li, Prashant Pandey 2026-07-31

Memory Microarchitecture Simulation × GPU Cache LLM Serving

InferScale is a GPU-native LLM memory system that replaces repeated prompt prefilling with reusable KV state, addressing the TTFT increase caused by injecting persistent personalized context into prompts. It precomputes KV representations for memory facts, stores them with semantic embeddings on the GPU, and injects them directly into vLLM's paged cache, using Chunked RoPE for dynamic memory assembly and Context-Window Encoding to mitigate the loss of cross-fact context. On LoCoMo with three open-weight models, InferScale keeps TTFT nearly constant as retrieval budget grows, reducing TTFT by 72-79% (3.6-4.8x) at k=50, achieving 60.3% accuracy versus 63.3% for Mem0, and delivering 3.7-4.5x throughput under concurrent load. This decouples memory-conditioned serving latency from retrieved-context size while preserving application quality, enabling efficient personalized LLM serving without engine modifications or fine-tuning.

PDF

Demystifying DRAM Read Disturbance: Bridging the Gap Between Experimental Characterization and Device-Level Modeling of RowHammer and RowPress Phenomena

Haocong Luo, Longda Zhou, Ataberk Olgun, İsmail Emir Yüksel 2026-07-31

Simulation × Design

The paper addresses the gap between empirical DRAM read disturbance studies (RowHammer/RowPress) and device-level physical models, which fail to explain key observed bitflip behaviors. The authors systematically compare three fundamental metrics—bitflip directions, bitflip counts, and ACmin—against existing models, then run extensive TCAD simulations to reproduce real-chip phenomena. The simulations reveal updated error mechanisms and identify critical modeling parameters that determine whether results match hardware characterization, though the abstract does not disclose quantitative experimental results. This work provides a principled foundation for developing more accurate characterization methodologies and effective mitigations, directly impacting the security and reliability of DRAM-based systems.

PDF