ITME: Inference Tiered Memory Expansion with Disaggregated CXL-Hybrid Memories

Hakbeom Jang, Younghoon Min, Sunwoong Kim, Taeyoung Ahn 2026-06-14

ITME addresses the problem of scaling shared context infrastructure for TB-scale LLM inference workloads beyond individual server capacity. The method leverages CXL-hybrid memory to provide massive, byte-addressable remote memory expansion, simplifying the software stack by eliminating complex software-level optimization. Experimental evidence from production-grade SK Hynix CMM and PCIe Gen5 NVMe SSDs, along with an FPGA prototype, shows up to a 35.7% throughput improvement over conventional CPU-offloading. This matters because ITME enables cost-efficient scaling of shared context layers for agentic and long-context LLMs by proactively managing data movement across the memory-storage hierarchy.

PDF

Harnessing Routing Foresight for Micro-step-level MoE load balancing in RL Post-training

Yuming Zhou, Haoyang Li, Sheng Lin, Yanfeng Zhao 2026-06-14

ForeMoE addresses expert load imbalance in Mixture-of-Experts (MoE) models during reinforcement learning (RL) post-training, where existing step-level statistics fail due to high-frequency micro-step fluctuations. The method exploits foreseeable routing information from the rollout stage to proactively guide load balancing, using a hierarchical planner to decompose the NP-hard problem and a transfer engine for overlapped expert transfer. Evaluations on 64 GPUs show up to a 1.45× speedup over state-of-the-art RL post-training systems. This matters because it enables efficient scaling of MoE LLMs under the unique workload dynamics of RL post-training, a dominant paradigm in current LLM development.

PDF

Maestro: Workload-Aware Cross-Cluster Scheduling for LLM-Based Multi-Agent Systems

Jinghao Wang, Xiao Zhou, Xiaoyang Sun, Yihui Zhang 2026-06-14

Maestro addresses the problem of high resource consumption and scheduling inefficiencies in deploying LLM-based multi-agent systems under strict GPU budgets. The method uses agent semantics to predict output length and memory usage, enabling hierarchical scheduling with dynamic model co-location, latency-aware routing, and workflow-aware prioritization. Experimental evidence shows Maestro reduces KV-reservation HBM by 67.2% and improves high-contention SLO attainment over EDF by 23.6 percentage points. This matters because it enables efficient, scalable deployment of complex multi-agent workflows in resource-constrained cloud environments.

PDF