Filtered by: Scheduling × Clear all

Maestro: Workload-Aware Cross-Cluster Scheduling for LLM-Based Multi-Agent Systems

Jinghao Wang, Xiao Zhou, Xiaoyang Sun, Yihui Zhang 2026-06-14

Maestro addresses the problem of high resource consumption and scheduling inefficiencies in deploying LLM-based multi-agent systems under strict GPU budgets. The method uses agent semantics to predict output length and memory usage, enabling hierarchical scheduling with dynamic model co-location, latency-aware routing, and workflow-aware prioritization. Experimental evidence shows Maestro reduces KV-reservation HBM by 67.2% and improves high-contention SLO attainment over EDF by 23.6 percentage points. This matters because it enables efficient, scalable deployment of complex multi-agent workflows in resource-constrained cloud environments.

PDF

SupraSNN: Exploiting Synapse-Level Parallelism in Spiking Neural Network Accelerators through Co-Optimized Mapping and Scheduling

Seyed Sadra Ghavami, Mohammad Hossein Nikkhah, Mohammad Rasoul Roshanshah, Saeed Safari 2026-06-14

The problem is that deploying Spiking Neural Networks (SNNs) on hardware is limited by the challenge of managing massive parallelism, analogous to the historical barrier of serial execution in processors. The method introduces SupraSNN, a superscalar-inspired hardware-software co-design framework that treats synaptic events as parallelizable micro-operations, using a Multi-Cast Tree, parallel Synapse Processing Units, and a Merge Tree with a unified Neuron Unit. Experimental evidence shows that on a Xilinx Zynq XC7Z020 FPGA, SupraSNN achieves 149 μs inference latency and 0.025 mJ per image for MNIST (93.44% accuracy), delivering 47.6% lower latency and 5.6× better energy efficiency than prior FPGA-based SNN accelerators. This matters because it demonstrates a practical path to high synapse-level parallelism and energy efficiency for SNN deployment, extending to recurrent SNNs on the Spiking Heidelberg Dataset.

PDF

GF-DiT: Scheduling Parallelism for Diffusion Transformer Serving

Xinwei Qiang, Yifan Hu, Shixuan Sun, Jing Yang 2026-06-14

The problem is that existing Diffusion Transformer (DiT) serving systems use static parallelism for each request, which is inefficient due to heterogeneity across requests, execution stages, and system conditions. GF-DiT introduces a policy-programmable runtime that dynamically adapts parallelism via an asynchronous execution abstraction and group-free collectives for low-overhead online GPU reallocation. Experimental evaluation in vLLM-Omni shows GF-DiT improves throughput by up to 6.01×, reduces mean latency by up to 95%, and lowers SLO violation rates by up to 90% compared to fixed-pipeline execution. This matters because it enables efficient, elastic DiT serving that treats GPU parallelism as a schedulable resource, significantly improving performance and service quality for image and video generation workloads.

PDF