Hawk: Harnessing Hardware-Aware Knowledge for High-Performance NPU Kernel Generation

Junyi Wen, Ruiyan Zhuang, Yongjia Xu, Pengtu Li 2026-07-03

The problem is that developing high-performance NPU kernels is a critical bottleneck, requiring manual navigation of implicit hardware constraints. The method introduces Hawk, a training-free framework with three modules that harness hardware-aware knowledge to generate correct and efficient kernels. Experimental evidence shows Hawk improves generation accuracy from 49.4% to 80.0% and achieves up to a 2.2x execution speedup over state-of-the-art baselines on real-world NPU workloads. This matters because it enables automated, high-performance kernel generation for NPUs, overcoming the failures of existing LLM-based approaches.

PDF

OmniPilot: An Uncertainty-Aware LLM Inference Advisor for Heterogeneous GPU Clusters

D. Balamurugan, Thomas W. Bush 2026-07-03

OmniPilot addresses the challenge of selecting GPU type, tensor-parallel degree, and precision for LLM serving on heterogeneous clusters, where static recipes fail due to fluctuating throughput, launch success, and hardware failures. The method combines a conformally calibrated quantile cost model for eight serving targets with an out-of-distribution (OOD) abstention layer, ranking configurations by an economic utility metric. Across 460 benchmark runs on A100, H100, and H200 hardware, OmniPilot achieves 6.2% MAPE for aggregate throughput, 95% top-1 accuracy, and 0.003 mean utility regret, while the abstention layer correctly flags all OOD points. This matters because it enables robust, uncertainty-aware configuration decisions that adapt to dynamic cluster conditions and expand the advisor's support envelope over time.

PDF

High-Performance NTT Accelerators for PQC leveraging Unified Redundant Arithmetic and Fine-Tuned Microarchitecture

George Alexakis, Dimitrios Schoinianakis, Giorgos Dimitrakopoulos 2026-07-03

The paper addresses the performance bottleneck of modular reduction and scaling overhead in NTT/INTT accelerators for lattice-based PQC schemes like ML-KEM and ML-DSA. The authors propose parallel iterative NTT/INTT accelerators using optimized unified butterfly units with a novel redundant number representation that eliminates conditional corrections and integrates inverse-transform scaling into existing hardware. FPGA-based experiments demonstrate higher clock frequencies, reduced execution times, and competitive resource utilization compared to prior designs. This matters because it enables more efficient polynomial arithmetic for post-quantum cryptography and privacy-preserving applications, critical for future secure communication systems.

PDF

Scalable and Distributed Silhouette Approximation

Ilie Sarpe, Federico Altieri, Andrea Pietracaprina, Geppino Pucci 2026-07-03

The problem is that exact silhouette computation for k-clustering requires Θ(n²) distance calculations, which is prohibitive for massive datasets, and existing approximate methods lack provable guarantees. The method introduces rigorous sampling-based algorithms that perform O(nkε⁻² ln(nk/δ)) distance computations to estimate both local and global silhouette values. Experimental evidence against state-of-the-art approaches shows that these new techniques achieve the best trade-off between accuracy and efficiency, scaling efficiently to massive datasets where exact computation is impractical. This matters because it provides the first provably accurate and efficient silhouette approximation, enabling scalable clustering quality assessment in distributed frameworks like MapReduce and MPC.

PDF

Cadence: Extreme Pipelining with Multiple Concurrent Proposers

Fatima Elsheimy, Mohammad Mussadiq Jalalzai, Tobias Klenze, Jovan Komatovic 2026-07-03

Cadence introduces a Byzantine fault-tolerant multi-proposer consensus protocol achieving arbitrarily low block intervals through extreme pipelining, which decouples block intervals from network latency by running independent consensus instances per slot. The method employs multiple concurrent proposers (MCP) with a general framework converting one-shot slot consensus into multi-shot protocols, instantiated via Chorus (optimal three-round fast path) and Conductor (adaptive slot pacing). In simulation over Monad's 200 validators with five proposers per slot, finalization averages 219 ms (167 ms to speculative finality), with a 50 ms average transaction wait at a 100 ms block interval. This matters because Cadence is the first MCP protocol to provide short-term censorship resistance and hiding at the fast-path latency of single-leader consensus, removing the single-leader monopoly on transaction ordering.

PDF

Kani: A Model Checker for Rust

Rémi Delmas, Zyad Hassan, Qinheping Hu, Rahul Kumar 2026-07-03

Kani addresses the problem of verifying safety properties in Rust beyond the guarantees of its ownership type system, including unsafe operations, functional correctness, and absence of panics. The method compiles proof harnesses from Rust's MIR into CBMC's bit-precise engine for bounded model checking, and extends to unbounded verification via a specification language with contracts, quantifiers, and stubbing. Experimental evidence from case studies on industrial Rust projects shows that contracts upgraded verification from panic-freedom to functional correctness, uncovering six previously unknown bugs, with over 16,000 harnesses verified per code change in the Rust standard library. This matters because Kani provides correctness guarantees at scale in production CI, moving beyond bug-finding to enable formal verification of critical Rust code.

PDF

Mixture-of-Parallelisms: Towards Memory-Efficient Training Stack for Mixture-of-Experts Models

Xuan-Phi Nguyen, Shrey Pandit, Yiran Zhao, Semih Yavuz 2026-07-03

The paper addresses the memory bottleneck in training Mixture-of-Experts (MoE) models by introducing Mixture-of-Parallelisms (MoP), a training stack that combines and specializes parallelism techniques across different layers and stages of the pipeline. MoP includes a novel optimizer strategy to maximize throughput and memory efficiency under hardware constraints. Experiments show MoP achieves 4.7x–8.2x higher per-GPU throughput than a tuned FSDP2 baseline, with the gap widening at larger scales, and sustains training at 1M context length where the baseline fails beyond 64–128K. This matters because it enables lossless pre-training and fine-tuning of trillion-parameter models at million-token context lengths using only 12 8x H200 GPU nodes, dramatically lowering the hardware barrier for large-scale MoE model training.

PDF

Markovian Arrival Process Parameter Estimation of Quasi-birth-death Queueing Systems with Utilization Data

Chen Li, Junjun Zheng, Hiroyuki Okamura, Tadashi Dohi 2026-07-03

The paper addresses the problem of estimating queueing system parameters when only utilization data (e.g., CPU busy fraction) is available, not detailed event logs. The method proposes an expectation-maximization (EM) algorithm for Markovian arrival process (MAP)-driven quasi-birth-death (QBD) systems, deriving expected sufficient statistics from observable and unobservable intervals. The abstract does not disclose experimental results. This matters because it enables practical queueing model parameter estimation in computer systems where detailed measurements are infeasible, using only easily collected utilization data.

PDF

FlintKV: A Fast Durable Storage Engine for Modern Databases

Sergey Egorov, Gregory Chockler, Brijesh Dongol, Dan O'Keeffe 2026-07-03

FlintKV addresses the problem that existing NVM-optimized key-value stores lack the full API support—such as point-in-time snapshots, consistent iterators, and atomic batches—required by modern databases. The method introduces a novel flat-combining based concurrency control algorithm with multi-versioning and co-designed persistence mechanisms, built on an NVM-optimized skiplist. Experimental evidence shows FlintKV achieves up to a 75% improvement in end-to-end throughput over prior work while guaranteeing durable linearizability. This matters because it enables production-grade database functionality on NVM without sacrificing performance, bridging the gap between research prototypes and real-world storage engines.

PDF