Daily | Yixun Hong

Filtered by: GPU × OpenMP × Clear all

On the Limits of Performance Portability in Directive-Based GPU Programming

Alessandro Romeo, Nitin Shukla, Stefano Truzzi, Alessio Suriano 2026-06-14

Exascale OpenMP × Performance Portability HPC Compiler Runtime GPU ×

The problem is that directive-based GPU programming faces fundamental trade-offs between performance, portability, and productivity when transitioning scientific applications to exascale systems. The method involved porting the production-grade magnetohydrodynamics code gPLUTO from OpenACC to OpenMP and evaluating its performance on NVIDIA A100 and AMD MI250X devices. Experimental evidence shows that while OpenACC and OpenMP achieve comparable performance on NVIDIA platforms, the same OpenMP implementation is approximately three times slower at the application level on AMD MI250X, with kernel-level slowdowns reaching up to 47x due to strided memory-access patterns, compiler limitations, and register pressure from C++ abstractions. This matters because it demonstrates that achieving portable performance across GPU architectures requires not only application-level changes but also continued advances in compiler backends and architecture-aware optimization strategies.

PDF

nomp: A Framework for Building Domain Specific Compilers

Thilina Ratnayaka, Kaushik Kulkarni, Nipuna Fernando, Pubudu Hewavitharana 2026-06-14

OpenMP × CUDA HPC Compiler Runtime

Problem: Existing GPU programming models force a trade-off between low-level performance and high-level productivity, with no single solution achieving all three goals of productivity, portability, and performance. Method: The authors propose nomp, a framework for building domain-specific compilers that uses a pragma-based programming model and a runtime for code transformation and generation based on user-provided metadata. Finding or experimental evidence: The abstract does not disclose experimental results. Why it matters: nomp aims to improve programmer productivity without sacrificing performance or portability by enabling reuse of domain-specific optimization patterns.

PDF