Daily | Yixun Hong

Optimizing CUDA like a Human: Micro-Profiling Tools as Expert Surrogates for LLM-Based GPU Kernel Optimization

Jiading Gai, Shuai Zhang, Kaj Bostrom, Jin Huang 2026-06-28

Triton CUDA Roofline Codesign Compiler Runtime HPC × LLM ×

KernelPro addresses the challenge of automated GPU kernel optimization by introducing a closed-loop multi-agent system that integrates LLM code generation with hardware profiler feedback and pluggable micro-profiling tools. The method employs a two-stage tool invocation architecture with roofline-based bottleneck classification, domain-adapted MCTS search, and direct CuTe source-level code generation from the CUTLASS/CuTe codebase. On KernelBench, KernelPro achieves geometric mean speedups of 2.42x, 4.69x, and 5.30x on Levels 1, 2, and 3, and a 1.23x improvement over hand-tuned Triton on VeOmni's MoE kernels, with ablation studies confirming significant contributions from each design component. This matters because KernelPro is the first CUDA kernel coding agent to optimize energy efficiency beyond speed, achieving an 11.6% measured energy reduction at matched speed, establishing state-of-the-art performance across all difficulty levels.

PDF