Daily | Yixun Hong

Filtered by: AI × Ph × Simulation × Clear all

Partitioned Tags, Shared Data: Reconciling Strict Cache Isolation with Write-Shared Coherence

Kartik Ramkrishnan, Stephen McCamant, Antonia Zhai, Pen Chung Yew 2026-06-14

Gem5 Microarchitecture Simulation × Cache

SCP solves the problem that write-shared coherence fails under strict cache partitioning, a decade-old barrier to deploying eviction-based side-channel defenses in secure shared-OS settings. The method partitions only the tags while sharing a single data pool, sizes the data pool to prevent capacity-driven cross-partition eviction, and routes writes to the LLC after a leakage threshold to mitigate coherence-based leakage. Experimental evidence from gem5 shows SCP mitigates Prime+Probe, Flush+Reload, and shared-writeable-line attacks to no better than random guessing, with a +2.8% LLC SRAM hardware cost and IPC within 0.3% of DAWG on SPEC CPU2017. This matters because SCP reconciles strict cache isolation with write-shared coherence, enabling secure partitioning without sacrificing performance or coherence correctness.

PDF

On the Limits of Performance Portability in Directive-Based GPU Programming

Alessandro Romeo, Nitin Shukla, Stefano Truzzi, Alessio Suriano 2026-06-14

Exascale OpenMP Performance Portability HPC Compiler Runtime GPU

The problem is that directive-based GPU programming faces fundamental trade-offs between performance, portability, and productivity when transitioning scientific applications to exascale systems. The method involved porting the production-grade magnetohydrodynamics code gPLUTO from OpenACC to OpenMP and evaluating its performance on NVIDIA A100 and AMD MI250X devices. Experimental evidence shows that while OpenACC and OpenMP achieve comparable performance on NVIDIA platforms, the same OpenMP implementation is approximately three times slower at the application level on AMD MI250X, with kernel-level slowdowns reaching up to 47x due to strided memory-access patterns, compiler limitations, and register pressure from C++ abstractions. This matters because it demonstrates that achieving portable performance across GPU architectures requires not only application-level changes but also continued advances in compiler backends and architecture-aware optimization strategies.

PDF