ITME: Inference Tiered Memory Expansion with Disaggregated CXL-Hybrid Memories

Hakbeom Jang, Younghoon Min, Sunwoong Kim, Taeyoung Ahn 2026-06-14

ITME addresses the problem of scaling shared context infrastructure for TB-scale LLM inference workloads beyond individual server capacity. The method leverages CXL-hybrid memory to provide massive, byte-addressable remote memory expansion, simplifying the software stack by eliminating complex software-level optimization. Experimental evidence from production-grade SK Hynix CMM and PCIe Gen5 NVMe SSDs, along with an FPGA prototype, shows up to a 35.7% throughput improvement over conventional CPU-offloading. This matters because ITME enables cost-efficient scaling of shared context layers for agentic and long-context LLMs by proactively managing data movement across the memory-storage hierarchy.

PDF

GF-DiT: Scheduling Parallelism for Diffusion Transformer Serving

Xinwei Qiang, Yifan Hu, Shixuan Sun, Jing Yang 2026-06-14

The problem is that existing Diffusion Transformer (DiT) serving systems use static parallelism for each request, which is inefficient due to heterogeneity across requests, execution stages, and system conditions. GF-DiT introduces a policy-programmable runtime that dynamically adapts parallelism via an asynchronous execution abstraction and group-free collectives for low-overhead online GPU reallocation. Experimental evaluation in vLLM-Omni shows GF-DiT improves throughput by up to 6.01×, reduces mean latency by up to 95%, and lowers SLO violation rates by up to 90% compared to fixed-pipeline execution. This matters because it enables efficient, elastic DiT serving that treats GPU parallelism as a schedulable resource, significantly improving performance and service quality for image and video generation workloads.

PDF