Fearless Concurrency on the GPU

Melih Elibol, Jared Roesch, Isaac Gelado, Eric Buehler 2026-06-16

The problem is that writing custom GPU kernels in Rust forces programmers outside the language's ownership guarantees, preventing safe systems programming on the GPU. The method is cuTile Rust, a tile-based system that extends Rust's ownership discipline to GPU kernels by splitting mutable outputs into disjoint pieces and preserving host-side ownership contracts. Experimental evidence shows that on the NVIDIA B200 GPU, cuTile Rust achieves 7 TB/s for element-wise operations and 2 PFlop/s for GEMM (96% of cuBLAS), and its Grout inference engine reaches 171 tokens/s for Qwen3-4B on the RTX 5090 and 82 tokens/s for Qwen3-32B on the B200, competitive with vLLM and SGLang. This matters because it enables safe, idiomatic GPU kernel authoring in Rust without sacrificing performance, making concurrent GPU programming both safer and more accessible.

PDF