Daily | Yixun Hong

Filtered by: LLM × AI × Training × Clear all

Harnessing Routing Foresight for Micro-step-level MoE load balancing in RL Post-training

Yuming Zhou, Haoyang Li, Sheng Lin, Yanfeng Zhao 2026-06-14

Language Model LLM × Training × GPU Hardware

ForeMoE addresses expert load imbalance in Mixture-of-Experts (MoE) models during reinforcement learning (RL) post-training, where existing step-level statistics fail due to high-frequency micro-step fluctuations. The method exploits foreseeable routing information from the rollout stage to proactively guide load balancing, using a hierarchical planner to decompose the NP-hard problem and a transfer engine for overlapped expert transfer. Evaluations on 64 GPUs show up to a 1.45× speedup over state-of-the-art RL post-training systems. This matters because it enables efficient scaling of MoE LLMs under the unique workload dynamics of RL post-training, a dominant paradigm in current LLM development.

PDF

MiniMax Sparse Attention

Xunhao Lai, Weiqi Xu, Yufeng Yang, Qiaorui Chen 2026-06-14

Sparsity

The problem is that quadratic-cost softmax attention makes ultra-long-context LLM inference untenable at deployment scale. The method, MiniMax Sparse Attention (MSA), uses a lightweight Index Branch for blockwise Top-k selection per GQA group and a Main Branch for exact block-sparse attention, co-designed with an exp-free GPU kernel. On a 109B multimodal model, MSA reduces per-token attention compute by 28.4x at 1M context and achieves 14.2x prefill and 7.6x decoding speedups on H800. This matters because it enables practical deployment of frontier LLMs with million-token contexts for agentic workflows and repository-scale reasoning.

PDF