Moebius: Serving Mixture-of-Expert Models with Seamless Runtime Parallelism Switch

Shaoyu Wang, Yizhuo Liang, Jaeyong Song, Chong Li 2026-06-28

Moebius addresses the problem that serving Mixture-of-Expert (MoE) models requires choosing between tensor parallelism (TP) and expert parallelism (EP), but the optimal choice depends on concurrency, which varies in production workloads. The method introduces a runtime parallelism switch that transitions between EP and TP without restarting the engine or dropping in-flight requests, by moving only the owner-changed slices of expert weights and KV cache using fused GPU-to-GPU transfer kernels. On 8x H200 GPUs serving Qwen3-235B-A22B, Moebius matches the better static parallelism at every operating point, achieves 1.16-1.25x speedup on RL rollouts, and completes each switch in 215-434 ms with only 2.4% memory overhead. This matters because it eliminates the performance penalty of pinning a single parallelism layout, enabling efficient serving under bursty and decaying concurrency patterns in production and reinforcement-learning workloads.

PDF

Optimizing CUDA like a Human: Micro-Profiling Tools as Expert Surrogates for LLM-Based GPU Kernel Optimization

Jiading Gai, Shuai Zhang, Kaj Bostrom, Jin Huang 2026-06-28

KernelPro addresses the challenge of automated GPU kernel optimization by introducing a closed-loop multi-agent system that integrates LLM code generation with hardware profiler feedback and pluggable micro-profiling tools. The method employs a two-stage tool invocation architecture with roofline-based bottleneck classification, domain-adapted MCTS search, and direct CuTe source-level code generation from the CUTLASS/CuTe codebase. On KernelBench, KernelPro achieves geometric mean speedups of 2.42x, 4.69x, and 5.30x on Levels 1, 2, and 3, and a 1.23x improvement over hand-tuned Triton on VeOmni's MoE kernels, with ablation studies confirming significant contributions from each design component. This matters because KernelPro is the first CUDA kernel coding agent to optimize energy efficiency beyond speed, achieving an 11.6% measured energy reduction at matched speed, establishing state-of-the-art performance across all difficulty levels.

PDF

TileMaxSim: IO-Aware GPU MaxSim Scoring with Dimension Tiling and Fused Product Quantization

Ashutosh Sharma 2026-06-28

The problem is that existing GPU implementations of MaxSim scoring for multi-vector retrieval models achieve only 5-18% of peak HBM bandwidth due to materializing the full similarity matrix. The method, TileMaxSim, introduces IO-aware Triton kernels with multi-query SRAM tiling, dimension tiling for embeddings exceeding 128 dimensions, and fused product-quantization scoring via shared-memory lookup tables. On NVIDIA H100 GPUs, TileMaxSim reaches 80.2% of peak HBM bandwidth and scores 82M documents/second, achieving a 220x speedup over loop-based scoring and cutting ColBERTv2/PLAID scoring latency from 268 ms to 1.2 ms. This matters because it provides a drop-in replacement that preserves exact retrieval quality while dramatically reducing end-to-end latency and enabling efficient GPU utilization for state-of-the-art multi-vector retrieval models.

PDF