Daily | Yixun Hong

NELSSA: A GPU-PNM Heterogeneous System for Mixed-Length LLM Serving via Length-based Request Placement

Sookyung Choi, Seungyong Lee, Kangkyu Park, Yunseo Chun 2026-07-30

Attention LLM Accelerator GPU Cxl Rdma Runtime × Serving

NELSSA addresses the problem of GPU-centric LLM serving systems suffering from throughput and latency inefficiencies when handling highly heterogeneous mixed-length workloads. The method introduces a GPU-PNM heterogeneous system that uses length-based request placement, routing short-context requests to GPUs and long-context requests to PNM, with runtime migration for dynamic context growth. Experimental evidence shows that NELSSA improves decode throughput by up to 5.5x in tokens/sec and reduces P99 latency by up to 15x compared to GPU-only baselines across mixed-length LLM workloads. This matters because it demonstrates that integrated GPU-PNM serving, enabled by CXL-based disaggregation, is a promising paradigm for building scalable and flexible LLM infrastructures that can efficiently support evolving workloads.

PDF

The Fabric Is the Cluster Driver: Cross-Layer eBPF Policies for GPU-CXL Fabrics

Yiwei Yang, Andi Quinn 2026-07-30

Attention LLM Compiler GPU Hardware Cxl Runtime × Tensor

fabric_ext introduces a cross-layer eBPF middleware compiler and runtime for enforcing extensible OS policies across GPU-CXL fabrics. The method uses a semantic movement graph abstraction that describes data movement characteristics and is compiled into per-device eBPF programs, BPF maps, and runtime artifacts for bpftime and dputime. Experimental evidence from the LLM prefill stress case shows that fabric_ext can manage attention and FFN dataflows crossing GPU tensor execution, DPU/NIC event execution, and CXL switch-local islands. This matters because it enables unified, data-driven policy enforcement across heterogeneous fabric components, improving observability and control in disaggregated memory systems.

PDF

At-the-Roofline Sparse Tensor Contractions on Vector Processors for Transformer Inference

Bowen Wang, Chi Zhang, Diyou Shen, Renzo Andri 2026-07-30

Isa Extension Roofline Codesign Compiler Runtime × HPC Sparsity

The problem is that fine-grained sparsity in Transformer inference cannot be efficiently exploited on vector processors because existing RVV architectures lack native support for Gustavson's dataflow, forcing reliance on software index decoding and L1-backed indexed memory operations that keep sparse tensor contractions far below the roofline bound. The method introduces Ventaglio, a runtime-configurable sparse execution unit with RVV ISA extensions that provides indexed gather-accumulate-scatter support to drive sparse tensor contractions toward their roofline performance. Experimental evidence from a 12nm FinFET implementation shows Ventaglio accelerates sparse tensor contraction kernels by 6.9–7.4× over optimized RVV baselines with only 3.1% area overhead, and on a DuoGPT-pruned LLaMA-3-8B model with 40–60% dual sparsity achieves 2.40–5.25× and 2.06–3.16× speedup over dense baselines during prefill and autoregressive decoding, respectively. This matters because it demonstrates that hardware-software co-design can close the roofline gap for sparse tensor contractions, enabling practical speedups for Transformer inference on vector processors without prohibitive area cost.

PDF

Beyond Prefill-Decode Disaggregation: Dissecting LLM Inference for Heterogeneous Platforms via Dynamic Operator Scheduling

Jiaqi Yang, Jiayi Li, Yihan Fu, Hongxiao Zhao 2026-07-30

Hardware Roofline Codesign Compiler Runtime × HPC LLM Inference

The problem is that prefill-decode disaggregation and roofline-based operator placement are insufficient for partitioning LLM inference across heterogeneous systems due to workload shape, device contention, and weight layout constraints. DOPS is a hardware-aware, closed-loop framework that uses a stage-aware DAG, a Bifocal scheduler for dynamic operator placement, and a Weight Layout Arbiter (WLA) for selecting efficient weight layouts. On heterogeneous platforms combining NPUs and PIM devices, Bifocal achieves geometric-mean speedups of 1.20× to 2.23× over the PD baseline, with WLA adding 1.28× to 1.33× further speedup. This matters because DOPS enables systematic analysis of workload sensitivity and hardware scalability, improving LLM serving efficiency on diverse heterogeneous hardware.

PDF