Filtered by: AI × Agent × Clear all

Arbor: Tree Search as a Cognition Layer for Autonomous Agents

Neha Prakriya, Chaojun Hou, Zheng Gong, Huasha Zhao 2026-06-14

Arbor addresses the problem of autonomous optimization in large, stateful action spaces by introducing a multi-agent framework with structured tree search as a shared cognition layer. The method pairs an Orchestrator agent with a Critic agent in a checks-and-balances architecture, using an explicit search tree of scored hypotheses as working memory. Experimental evidence shows Arbor achieves up to 193% inference throughput-latency Pareto improvement over vendor-optimized baselines, while a single agent without the harness plateaus at +33% and crashes within hours. This matters because it enables fully autonomous, hardware-agnostic, and reproducible multi-day optimization campaigns across the full LLM inference stack.

PDF

ITME: Inference Tiered Memory Expansion with Disaggregated CXL-Hybrid Memories

Hakbeom Jang, Younghoon Min, Sunwoong Kim, Taeyoung Ahn 2026-06-14

ITME addresses the problem of scaling shared context infrastructure for TB-scale LLM inference workloads beyond individual server capacity. The method leverages CXL-hybrid memory to provide massive, byte-addressable remote memory expansion, simplifying the software stack by eliminating complex software-level optimization. Experimental evidence from production-grade SK Hynix CMM and PCIe Gen5 NVMe SSDs, along with an FPGA prototype, shows up to a 35.7% throughput improvement over conventional CPU-offloading. This matters because ITME enables cost-efficient scaling of shared context layers for agentic and long-context LLMs by proactively managing data movement across the memory-storage hierarchy.

PDF

MiniMax Sparse Attention

Xunhao Lai, Weiqi Xu, Yufeng Yang, Qiaorui Chen 2026-06-14

The problem is that quadratic-cost softmax attention makes ultra-long-context LLM inference untenable at deployment scale. The method, MiniMax Sparse Attention (MSA), uses a lightweight Index Branch for blockwise Top-k selection per GQA group and a Main Branch for exact block-sparse attention, co-designed with an exp-free GPU kernel. On a 109B multimodal model, MSA reduces per-token attention compute by 28.4x at 1M context and achieves 14.2x prefill and 7.6x decoding speedups on H800. This matters because it enables practical deployment of frontier LLMs with million-token contexts for agentic workflows and repository-scale reasoning.

PDF

Maestro: Workload-Aware Cross-Cluster Scheduling for LLM-Based Multi-Agent Systems

Jinghao Wang, Xiao Zhou, Xiaoyang Sun, Yihui Zhang 2026-06-14

Maestro addresses the problem of high resource consumption and scheduling inefficiencies in deploying LLM-based multi-agent systems under strict GPU budgets. The method uses agent semantics to predict output length and memory usage, enabling hierarchical scheduling with dynamic model co-location, latency-aware routing, and workflow-aware prioritization. Experimental evidence shows Maestro reduces KV-reservation HBM by 67.2% and improves high-contention SLO attainment over EDF by 23.6 percentage points. This matters because it enables efficient, scalable deployment of complex multi-agent workflows in resource-constrained cloud environments.

PDF

An LLM System for Autonomous Variational Quantum Circuit Design

Kenya Sakka, Wataru Mizukami, Kosuke Mitarai 2026-06-14

The problem is that designing high-performing quantum circuits remains heavily reliant on human expertise. The method introduces an autonomous agentic framework using LLMs with seven integrated components for iterative circuit design under explicit constraints. Experimental evidence shows the framework outperforms representative quantum feature maps on image classification and achieves competitive accuracy for molecular ground state estimation across seven molecules. This matters because it establishes LLM-driven agentic systems as a viable paradigm for automated quantum circuit design and demonstrates AI's role in iterative scientific optimization.

PDF