Filtered by: HPC × Serving × Clear all

Moebius: Serving Mixture-of-Expert Models with Seamless Runtime Parallelism Switch

Shaoyu Wang, Yizhuo Liang, Jaeyong Song, Chong Li 2026-06-28

Moebius addresses the problem that serving Mixture-of-Expert (MoE) models requires choosing between tensor parallelism (TP) and expert parallelism (EP), but the optimal choice depends on concurrency, which varies in production workloads. The method introduces a runtime parallelism switch that transitions between EP and TP without restarting the engine or dropping in-flight requests, by moving only the owner-changed slices of expert weights and KV cache using fused GPU-to-GPU transfer kernels. On 8x H200 GPUs serving Qwen3-235B-A22B, Moebius matches the better static parallelism at every operating point, achieves 1.16-1.25x speedup on RL rollouts, and completes each switch in 215-434 ms with only 2.4% memory overhead. This matters because it eliminates the performance penalty of pinning a single parallelism layout, enabling efficient serving under bursty and decaying concurrency patterns in production and reinforcement-learning workloads.

PDF