Daily | Yixun Hong

NELSSA: A GPU-PNM Heterogeneous System for Mixed-Length LLM Serving via Length-based Request Placement

Sookyung Choi, Seungyong Lee, Kangkyu Park, Yunseo Chun 2026-07-30

Attention LLM Accelerator GPU × Cxl Rdma Runtime Serving

NELSSA addresses the problem of GPU-centric LLM serving systems suffering from throughput and latency inefficiencies when handling highly heterogeneous mixed-length workloads. The method introduces a GPU-PNM heterogeneous system that uses length-based request placement, routing short-context requests to GPUs and long-context requests to PNM, with runtime migration for dynamic context growth. Experimental evidence shows that NELSSA improves decode throughput by up to 5.5x in tokens/sec and reduces P99 latency by up to 15x compared to GPU-only baselines across mixed-length LLM workloads. This matters because it demonstrates that integrated GPU-PNM serving, enabled by CXL-based disaggregation, is a promising paradigm for building scalable and flexible LLM infrastructures that can efficiently support evolving workloads.

PDF

The Fabric Is the Cluster Driver: Cross-Layer eBPF Policies for GPU-CXL Fabrics

Yiwei Yang, Andi Quinn 2026-07-30

Attention LLM Compiler GPU × Hardware Cxl Runtime Tensor

fabric_ext introduces a cross-layer eBPF middleware compiler and runtime for enforcing extensible OS policies across GPU-CXL fabrics. The method uses a semantic movement graph abstraction that describes data movement characteristics and is compiled into per-device eBPF programs, BPF maps, and runtime artifacts for bpftime and dputime. Experimental evidence from the LLM prefill stress case shows that fabric_ext can manage attention and FFN dataflows crossing GPU tensor execution, DPU/NIC event execution, and CXL switch-local islands. This matters because it enables unified, data-driven policy enforcement across heterogeneous fabric components, improving observability and control in disaggregated memory systems.

PDF

InferScale: GPU-Native KV Injection for Personalized LLM Serving

Peter Li, Prashant Pandey 2026-07-30

Memory Microarchitecture Simulation GPU × Cache LLM Serving

InferScale addresses the problem of increasing time-to-first-token (TTFT) in personalized LLM serving due to repeated prefilling of persistent user memory. The method precomputes KV representations for each memory fact, stores them on the GPU with semantic embeddings, and injects them directly into vLLM's paged cache using Chunked RoPE and Context-Window Encoding. On LoCoMo with three open-weight models, InferScale reduces TTFT by 72-79% (3.6-4.8x) at k=50, achieves 60.3% accuracy versus 63.3% for Mem0, and delivers 3.7-4.5x throughput under concurrent load. This matters because reusable KV state decouples memory-conditioned serving latency from retrieved-context size, enabling scalable personalized LLM serving without engine modifications or fine-tuning.

PDF