Daily | Yixun Hong

On the Limits of Performance Portability in Directive-Based GPU Programming

Alessandro Romeo, Nitin Shukla, Stefano Truzzi, Alessio Suriano 2026-06-14

Exascale OpenMP × Performance Portability HPC Compiler Runtime GPU

The problem is that directive-based GPU programming faces fundamental trade-offs between performance, portability, and productivity when transitioning scientific applications to exascale systems. The method involved porting the production-grade magnetohydrodynamics code gPLUTO from OpenACC to OpenMP and evaluating its performance on NVIDIA A100 and AMD MI250X devices. Experimental evidence shows that while OpenACC and OpenMP achieve comparable performance on NVIDIA platforms, the same OpenMP implementation is approximately three times slower at the application level on AMD MI250X, with kernel-level slowdowns reaching up to 47x due to strided memory-access patterns, compiler limitations, and register pressure from C++ abstractions. This matters because it demonstrates that achieving portable performance across GPU architectures requires not only application-level changes but also continued advances in compiler backends and architecture-aware optimization strategies.

PDF