Daily | Yixun Hong

Filtered by: Training × Language × Clear all

Harnessing Routing Foresight for Micro-step-level MoE load balancing in RL Post-training

Yuming Zhou, Haoyang Li, Sheng Lin, Yanfeng Zhao 2026-06-14

Language × Model LLM Training × GPU Hardware

ForeMoE addresses expert load imbalance in Mixture-of-Experts (MoE) models during reinforcement learning (RL) post-training, where existing step-level statistics fail due to high-frequency micro-step fluctuations. The method exploits foreseeable routing information from the rollout stage to proactively guide load balancing, using a hierarchical planner to decompose the NP-hard problem and a transfer engine for overlapped expert transfer. Evaluations on 64 GPUs show up to a 1.45× speedup over state-of-the-art RL post-training systems. This matters because it enables efficient scaling of MoE LLMs under the unique workload dynamics of RL post-training, a dominant paradigm in current LLM development.

PDF

Attention by Synchronization in Coupled Oscillator Networks

Fabio Pasqualetti, Taosha Guo 2026-06-14

Attention Language × Model Training × Transformer Hardware Network

The problem is that softmax attention requires exponentiation and global reduction, which are energy-expensive on von Neumann hardware and lack a natural physical analog. The method replaces softmax with Kuramoto synchronization dynamics, where queries are fixed anchors on a sphere and free oscillators equilibrate to encode attention weights via cosine similarity. Experimental evidence shows that at oscillator dimension 2, oscillator attention outperforms softmax on keyword spotting (+1.00 pp) and subject-verb agreement (+5.27 pp), while on causal language modeling it closes the perplexity gap as dimension increases, from +11.09 to +2.98 on WikiText-2 and from +2.39 to +0.57 on TinyStories. This matters because it provides a mathematically grounded blueprint for accurate attention on energy-constrained physical substrates without requiring exponentiation or global reduction.

PDF