Daily | Yixun Hong

A Photonic-CXL Memory Appliance for Scalable KV Cache Management in LLM Inference

Jing Ding, Yash Nishant, Chandrish Ambati, Jyothsna Kamati 2026-07-31

Cache LLM Inference GPU × Architecture Simulation Workload

The paper addresses the memory wall in LLM inference, where KV cache demands tens of terabytes at hundreds of GB/s exceed current memory tier capabilities. The proposed Marvell Photonic Fabric Memory Appliance replaces electrical switches with a passive fiber shuffle in a switch-free full-crossbar topology, delivering 32 TB shared memory across 16 hosts via photonic-CXL hybrid architecture. Emulation results show over 50% latency reduction versus electrical CXL pools, while simulation demonstrates a 6.6x improvement in time-to-first-token by eliminating cache eviction cliffs for multi-turn workloads. This work matters because it enables practical TB-scale shared memory for concurrent long-context users, overcoming the scalability limits of electrical CXL pooling in real deployments.

PDF

InferScale: GPU-Native KV Injection for Personalized LLM Serving

Peter Li, Prashant Pandey 2026-07-31

Memory Microarchitecture Simulation GPU × Cache LLM Serving

InferScale is a GPU-native LLM memory system that replaces repeated prompt prefilling with reusable KV state, addressing the TTFT increase caused by injecting persistent personalized context into prompts. It precomputes KV representations for memory facts, stores them with semantic embeddings on the GPU, and injects them directly into vLLM's paged cache, using Chunked RoPE for dynamic memory assembly and Context-Window Encoding to mitigate the loss of cross-fact context. On LoCoMo with three open-weight models, InferScale keeps TTFT nearly constant as retrieval budget grows, reducing TTFT by 72-79% (3.6-4.8x) at k=50, achieving 60.3% accuracy versus 63.3% for Mem0, and delivering 3.7-4.5x throughput under concurrent load. This decouples memory-conditioned serving latency from retrieved-context size while preserving application quality, enabling efficient personalized LLM serving without engine modifications or fine-tuning.

PDF

Queue-Theoretic Admission Control for Multi-Tenant GPU Clusters

Sohan Kunkerkar 2026-07-31

GPU × Workload Network

Problem: GPU cluster operators cannot predict workload admission wait times, and existing greedy heuristics lack formal guarantees. Method: We formalize admission as a multi-class, multi-resource queueing network, prove a structural decomposition into quotable and unfeasible workloads, and model quotable queues as M/G/k systems with effective server count from vector packing. Finding: We prove optimal admission ordering is NP-hard via vector bin packing, and validation on Kueue shows the vector k_eff identifies bottleneck resources, Little's Law holds exactly, and Erlang-C overestimates wait times conservatively. Why it matters: This provides the first formal wait-time bounds for multi-tenant GPU clusters, enabling predictable admission control despite NP-hard optimality.

PDF