Cache-Resident LLM Inference in GB-Scale Last-Level Caches

Wanning Zhang, Tongzhou Gu, Marco Canini, Ceyu Xu 2026-06-28

The problem is that LLM inference is dominated by data movement across the memory hierarchy, and achieving cache-resident execution is complicated by deeper pipelining, increased KV-cache footprint, and synchronization bottlenecks at operator boundaries. The method introduces a cache-resident execution model that separates weight-centric operators from attention and KV-cache management into dedicated resource domains, relaxes synchronization to sub-operator dependencies, and is instantiated on a multi-socket CPU cluster with a weight-attention decoupled architecture. Experimental evidence shows the prototype achieves 2.04x-11.51x speedup on time-per-output-token for deployed Llama models and up to 13.9x speedup under a validated analytical model, substantially outperforming equally provisioned llama.cpp. This matters because it demonstrates that commodity CPUs with GB-scale last-level caches can efficiently support LLM inference through cache residency, decoupled state management, and dependency-aware coordination.

PDF