Daily | Yixun Hong

A Photonic-CXL Memory Appliance for Scalable KV Cache Management in LLM Inference

Jing Ding, Yash Nishant, Chandrish Ambati, Jyothsna Kamati 2026-07-31

Cache × LLM Inference GPU Architecture Simulation Workload

The paper addresses the memory wall in LLM inference, where KV cache demands tens of terabytes at hundreds of GB/s exceed current memory tier capabilities. The proposed Marvell Photonic Fabric Memory Appliance replaces electrical switches with a passive fiber shuffle in a switch-free full-crossbar topology, delivering 32 TB shared memory across 16 hosts via photonic-CXL hybrid architecture. Emulation results show over 50% latency reduction versus electrical CXL pools, while simulation demonstrates a 6.6x improvement in time-to-first-token by eliminating cache eviction cliffs for multi-turn workloads. This work matters because it enables practical TB-scale shared memory for concurrent long-context users, overcoming the scalability limits of electrical CXL pooling in real deployments.

PDF

InferScale: GPU-Native KV Injection for Personalized LLM Serving

Peter Li, Prashant Pandey 2026-07-31

Memory Microarchitecture Simulation GPU Cache × LLM Serving

InferScale is a GPU-native LLM memory system that replaces repeated prompt prefilling with reusable KV state, addressing the TTFT increase caused by injecting persistent personalized context into prompts. It precomputes KV representations for memory facts, stores them with semantic embeddings on the GPU, and injects them directly into vLLM's paged cache, using Chunked RoPE for dynamic memory assembly and Context-Window Encoding to mitigate the loss of cross-fact context. On LoCoMo with three open-weight models, InferScale keeps TTFT nearly constant as retrieval budget grows, reducing TTFT by 72-79% (3.6-4.8x) at k=50, achieving 60.3% accuracy versus 63.3% for Mem0, and delivering 3.7-4.5x throughput under concurrent load. This decouples memory-conditioned serving latency from retrieved-context size while preserving application quality, enabling efficient personalized LLM serving without engine modifications or fine-tuning.

PDF

Extended Depth-First Representations of $k^2$-trees

Gabriel Carmona, Paolo Ferragina, Giovanni Manzini, Francesco Tosoni 2026-07-31

Cache × Workload Data

This paper addresses the problem of poor cache performance and memory locality in traditional level-wise $k^2$-tree layouts for static graph compression. The authors propose four depth-first representations—EDF-1, BP, CEDF, and CBP—along with a linear-time compression method using suffix and LCP arrays to merge identical subtrees. Experiments on Web Graphs, Wikidata, and random adjacency matrices show that CEDF achieves the best compression in most cases, EDF-1 and CEDF consistently reduce peak memory usage, and performance varies by workload across matrix-vector and matrix-matrix operations. These findings establish depth-first $k^2$-tree layouts as a practical, efficient alternative to traditional layouts, improving both compression and computational performance in linear-algebra operations.

PDF