Hawk: Harnessing Hardware-Aware Knowledge for High-Performance NPU Kernel Generation
The problem is that developing high-performance NPU kernels is a critical bottleneck, requiring manual navigation of implicit hardware constraints. The method introduces Hawk, a training-free framework with three modules that harness hardware-aware knowledge to generate correct and efficient kernels. Experimental evidence shows Hawk improves generation accuracy from 49.4% to 80.0% and achieves up to a 2.2x execution speedup over state-of-the-art baselines on real-world NPU workloads. This matters because it enables automated, high-performance kernel generation for NPUs, overcoming the failures of existing LLM-based approaches.