MiniMax Sparse Attention
The problem is that quadratic-cost softmax attention makes ultra-long-context LLM inference untenable at deployment scale. The method, MiniMax Sparse Attention (MSA), uses a lightweight Index Branch for blockwise Top-k selection per GQA group and a Main Branch for exact block-sparse attention, co-designed with an exp-free GPU kernel. On a 109B multimodal model, MSA reduces per-token attention compute by 28.4x at 1M context and achieves 14.2x prefill and 7.6x decoding speedups on H800. This matters because it enables practical deployment of frontier LLMs with million-token contexts for agentic workflows and repository-scale reasoning.