Filtered by: Speed × Model × Clear all

ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling

Sihwa Lee, Janghwan Lee, Donghoon Yoo, Jae Gon Kim 2026-06-14

Problem: Large reasoning models (LRMs) incur high inference costs due to long reasoning traces, and directly applying NVFP4 low-precision quantization degrades reasoning accuracy while existing kernels fail to deliver latency benefits in small-batch autoregressive decoding. Method: ReSET proposes a step-aware temperature scaling method that estimates step-level uncertainty online using both token-level and step-level entropy signals, and introduces a CUDA-core small-M NVFP4 kernel for latency-critical decoding. Finding: ReSET improves NVFP4 reasoning accuracy by up to ~2 points over the NVFP4 baseline, and the custom kernel achieves up to 2.5× kernel-level speedup over NVFP4 vLLM and approximately 2× end-to-end decoding speedup over BF16. Why it matters: This work enables accurate and efficient low-precision inference for latency-critical LRM deployments, reducing computational and memory costs without sacrificing reasoning quality.

PDF

Harnessing Routing Foresight for Micro-step-level MoE load balancing in RL Post-training

Yuming Zhou, Haoyang Li, Sheng Lin, Yanfeng Zhao 2026-06-14

ForeMoE addresses expert load imbalance in Mixture-of-Experts (MoE) models during reinforcement learning (RL) post-training, where existing step-level statistics fail due to high-frequency micro-step fluctuations. The method exploits foreseeable routing information from the rollout stage to proactively guide load balancing, using a hierarchical planner to decompose the NP-hard problem and a transfer engine for overlapped expert transfer. Evaluations on 64 GPUs show up to a 1.45× speedup over state-of-the-art RL post-training systems. This matters because it enables efficient scaling of MoE LLMs under the unique workload dynamics of RL post-training, a dominant paradigm in current LLM development.

PDF

MiniMax Sparse Attention

Xunhao Lai, Weiqi Xu, Yufeng Yang, Qiaorui Chen 2026-06-14

The problem is that quadratic-cost softmax attention makes ultra-long-context LLM inference untenable at deployment scale. The method, MiniMax Sparse Attention (MSA), uses a lightweight Index Branch for blockwise Top-k selection per GQA group and a Main Branch for exact block-sparse attention, co-designed with an exp-free GPU kernel. On a 109B multimodal model, MSA reduces per-token attention compute by 28.4x at 1M context and achieves 14.2x prefill and 7.6x decoding speedups on H800. This matters because it enables practical deployment of frontier LLMs with million-token contexts for agentic workflows and repository-scale reasoning.

PDF