Daily | Yixun Hong

Filtered by: GPU × Workload × Clear all

A Photonic-CXL Memory Appliance for Scalable KV Cache Management in LLM Inference

Jing Ding, Yash Nishant, Chandrish Ambati, Jyothsna Kamati 2026-07-31

Cache LLM Inference GPU × Architecture Simulation Workload ×

The paper addresses the memory wall in LLM inference, where KV cache demands tens of terabytes at hundreds of GB/s exceed current memory tier capabilities. The proposed Marvell Photonic Fabric Memory Appliance replaces electrical switches with a passive fiber shuffle in a switch-free full-crossbar topology, delivering 32 TB shared memory across 16 hosts via photonic-CXL hybrid architecture. Emulation results show over 50% latency reduction versus electrical CXL pools, while simulation demonstrates a 6.6x improvement in time-to-first-token by eliminating cache eviction cliffs for multi-turn workloads. This work matters because it enables practical TB-scale shared memory for concurrent long-context users, overcoming the scalability limits of electrical CXL pooling in real deployments.

PDF

Queue-Theoretic Admission Control for Multi-Tenant GPU Clusters

Sohan Kunkerkar 2026-07-31

GPU × Workload × Network

Problem: GPU cluster operators cannot predict workload admission wait times, and existing greedy heuristics lack formal guarantees. Method: We formalize admission as a multi-class, multi-resource queueing network, prove a structural decomposition into quotable and unfeasible workloads, and model quotable queues as M/G/k systems with effective server count from vector packing. Finding: We prove optimal admission ordering is NP-hard via vector bin packing, and validation on Kueue shows the vector k_eff identifies bottleneck resources, Little's Law holds exactly, and Erlang-C overestimates wait times conservatively. Why it matters: This provides the first formal wait-time bounds for multi-tenant GPU clusters, enabling predictable admission control despite NP-hard optimality.

PDF