Arbor: Tree Search as a Cognition Layer for Autonomous Agents

Neha Prakriya, Chaojun Hou, Zheng Gong, Huasha Zhao 2026-06-14

Arbor addresses the problem of autonomous optimization in large, stateful action spaces by introducing a multi-agent framework with structured tree search as a shared cognition layer. The method pairs an Orchestrator agent with a Critic agent in a checks-and-balances architecture, using an explicit search tree of scored hypotheses as working memory. Experimental evidence shows Arbor achieves up to 193% inference throughput-latency Pareto improvement over vendor-optimized baselines, while a single agent without the harness plateaus at +33% and crashes within hours. This matters because it enables fully autonomous, hardware-agnostic, and reproducible multi-day optimization campaigns across the full LLM inference stack.

PDF

ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling

Sihwa Lee, Janghwan Lee, Donghoon Yoo, Jae Gon Kim 2026-06-14

Problem: Large reasoning models (LRMs) incur high inference costs due to long reasoning traces, and directly applying NVFP4 low-precision quantization degrades reasoning accuracy while existing kernels fail to deliver latency benefits in small-batch autoregressive decoding. Method: ReSET proposes a step-aware temperature scaling method that estimates step-level uncertainty online using both token-level and step-level entropy signals, and introduces a CUDA-core small-M NVFP4 kernel for latency-critical decoding. Finding: ReSET improves NVFP4 reasoning accuracy by up to ~2 points over the NVFP4 baseline, and the custom kernel achieves up to 2.5× kernel-level speedup over NVFP4 vLLM and approximately 2× end-to-end decoding speedup over BF16. Why it matters: This work enables accurate and efficient low-precision inference for latency-critical LRM deployments, reducing computational and memory costs without sacrificing reasoning quality.

PDF

ITME: Inference Tiered Memory Expansion with Disaggregated CXL-Hybrid Memories

Hakbeom Jang, Younghoon Min, Sunwoong Kim, Taeyoung Ahn 2026-06-14

ITME addresses the problem of scaling shared context infrastructure for TB-scale LLM inference workloads beyond individual server capacity. The method leverages CXL-hybrid memory to provide massive, byte-addressable remote memory expansion, simplifying the software stack by eliminating complex software-level optimization. Experimental evidence from production-grade SK Hynix CMM and PCIe Gen5 NVMe SSDs, along with an FPGA prototype, shows up to a 35.7% throughput improvement over conventional CPU-offloading. This matters because ITME enables cost-efficient scaling of shared context layers for agentic and long-context LLMs by proactively managing data movement across the memory-storage hierarchy.

PDF

MiniMax Sparse Attention

Xunhao Lai, Weiqi Xu, Yufeng Yang, Qiaorui Chen 2026-06-14

The problem is that quadratic-cost softmax attention makes ultra-long-context LLM inference untenable at deployment scale. The method, MiniMax Sparse Attention (MSA), uses a lightweight Index Branch for blockwise Top-k selection per GQA group and a Main Branch for exact block-sparse attention, co-designed with an exp-free GPU kernel. On a 109B multimodal model, MSA reduces per-token attention compute by 28.4x at 1M context and achieves 14.2x prefill and 7.6x decoding speedups on H800. This matters because it enables practical deployment of frontier LLMs with million-token contexts for agentic workflows and repository-scale reasoning.

PDF