Daily | Yixun Hong

Filtered by: Agent × Model × Workload × Clear all

ITME: Inference Tiered Memory Expansion with Disaggregated CXL-Hybrid Memories

Hakbeom Jang, Younghoon Min, Sunwoong Kim, Taeyoung Ahn 2026-06-14

Inference

ITME addresses the problem of scaling shared context infrastructure for TB-scale LLM inference workloads beyond individual server capacity. The method leverages CXL-hybrid memory to provide massive, byte-addressable remote memory expansion, simplifying the software stack by eliminating complex software-level optimization. Experimental evidence from production-grade SK Hynix CMM and PCIe Gen5 NVMe SSDs, along with an FPGA prototype, shows up to a 35.7% throughput improvement over conventional CPU-offloading. This matters because ITME enables cost-efficient scaling of shared context layers for agentic and long-context LLMs by proactively managing data movement across the memory-storage hierarchy.

PDF

Maestro: Workload-Aware Cross-Cluster Scheduling for LLM-Based Multi-Agent Systems

Jinghao Wang, Xiao Zhou, Xiaoyang Sun, Yihui Zhang 2026-06-14

Workload × Scheduling LLM Agent ×

Maestro addresses the problem of high resource consumption and scheduling inefficiencies in deploying LLM-based multi-agent systems under strict GPU budgets. The method uses agent semantics to predict output length and memory usage, enabling hierarchical scheduling with dynamic model co-location, latency-aware routing, and workflow-aware prioritization. Experimental evidence shows Maestro reduces KV-reservation HBM by 67.2% and improves high-contention SLO attainment over EDF by 23.6 percentage points. This matters because it enables efficient, scalable deployment of complex multi-agent workflows in resource-constrained cloud environments.

PDF