Daily | Yixun Hong

AgenticCANN: Automated Ascend C Operator Generation via Knowledge-Augmented Agentic Evolution

Junhao Qiu, Zidong Wang, Yansong Sun, Zhitong Ma 2026-07-30

Inference × Language Model LLM × Hardware Agentic

AgenticCANN addresses the problem of automated Ascend C operator generation for NPUs, which requires deep hardware expertise and faces a severe platform knowledge deficit. The method introduces a knowledge-augmented agentic evolution framework with a knowledge-orchestrated generation system and a stage-adaptive agentic evolution strategy to dynamically align LLM interaction modes. Experiments on Huawei Ascend 910B across six operators show 90-100% feasibility on elementwise and normalization operators, 56% on fusion operators, and up to 6.65× speedup on 1B Pangu model inference kernels, with knowledge injection monotonically improving feasibility from 57% to 86%. This matters because it enables automated, high-performance operator synthesis in low-corpus NPU environments, overcoming the unique challenges of the Ascend C programming model.

PDF

DualDecoder: Accelerate Long Context LLM Inference by Predictive Prefetch

Zuning Liang, Zhiyi Yao, Qi Chen, Yuedong Xu 2026-07-30

Prefetching Microarchitecture Simulation LLM × Inference ×

DualDecoder addresses the memory bottleneck in long-context LLM inference caused by the growing KV cache. It predicts critical KV entries for the next token from the preceding speculated token, enabling proactive prefetching from host memory. Experiments show up to 2.62× throughput improvement over state-of-the-art systems without degrading latency or model quality. This matters because it eliminates GPU memory overhead from auxiliary states, making high-concurrency long-context inference more efficient for agentic applications.

PDF

Beyond Prefill-Decode Disaggregation: Dissecting LLM Inference for Heterogeneous Platforms via Dynamic Operator Scheduling

Jiaqi Yang, Jiayi Li, Yihan Fu, Hongxiao Zhao 2026-07-30

Hardware Roofline Codesign Compiler Runtime HPC LLM × Inference ×

The problem is that prefill-decode disaggregation and roofline-based operator placement are insufficient for partitioning LLM inference across heterogeneous systems due to workload shape, device contention, and weight layout constraints. DOPS is a hardware-aware, closed-loop framework that uses a stage-aware DAG, a Bifocal scheduler for dynamic operator placement, and a Weight Layout Arbiter (WLA) for selecting efficient weight layouts. On heterogeneous platforms combining NPUs and PIM devices, Bifocal achieves geometric-mean speedups of 1.20× to 2.23× over the PD baseline, with WLA adding 1.28× to 1.33× further speedup. This matters because DOPS enables systematic analysis of workload sensitivity and hardware scalability, improving LLM serving efficiency on diverse heterogeneous hardware.

PDF