(searched for: doi:10.1145/3297858.3304055)
Published: 26 October 2021
Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles CD-ROM; https://doi.org/10.1145/3477132.3483545
Newer use cases of GPU (Graphics Processing Unit) computing, e.g., graph analytics, look less like traditional bulk-synchronous GPU programs. To cater to the needs of emerging applications with semantically richer and finer grain sharing patterns, GPU vendors have been introducing advanced programming features, e.g., scoped synchronization and independent thread scheduling. While these features can speed up many applications and enable newer use cases, they can also introduce subtle synchronization errors if used incorrectly. We present iGUARD, a runtime software tool to detect races in GPU programs due to incorrect use of such advanced features. A key need for a race detector to be practical is to accurately detect races at reasonable overheads. We thus perform the race detection on the GPU itself without relying on the CPU. The GPU's parallelism helps speed up race detection by 15x over a closely related prior work. Importantly, iGUARD detects newer types of races that were hitherto not possible for any known tool. It detected previously unknown subtle bugs in popular GPU programs, including three in NVIDIA supported commercial libraries. In total, iGUARD detected 57 races in 21 GPU programs, without false positives.
Published: 11 July 2021
Concurrency and Computation: Practice and Experience; https://doi.org/10.1002/cpe.6483
The publisher has not yet granted permission to display this abstract.
Published: 21 January 2021
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-26; https://doi.org/10.1145/3429981
Stencil codes (a.k.a. nearest-neighbor computations) are widely used in image processing, machine learning, and scientific applications. Stencil codes incur nearest-neighbor data exchange because the value of each point in the structured grid is calculated as a function of its value and the values of a subset of its nearest-neighbor points. When running on Graphics Processing Unit (GPUs), stencil codes exhibit a high degree of data sharing between nearest-neighbor threads. Sharing is typically implemented through shared memories, shuffle instructions, and on-chip caches and often incurs performance overheads due to the redundancy in memory accesses. In this article, we propose Neighbor Data (NeDa), a direct nearest-neighbor data sharing mechanism that uses two registers embedded in each streaming processor (SP) that can be accessed by nearest-neighbor SP cores. The registers are compiler-allocated and serve as a data exchange mechanism to eliminate nearest-neighbor shared accesses. NeDa is embedded carefully with local wires between SP cores so as to minimize the impact on density. We place and route NeDa in an open-source GPU and show a small area overhead of 1.3%. The cycle-accurate simulation indicates an average performance improvement of 21.8% and power reduction of up to 18.3% for stencil codes in General-Purpose Graphics Processing Unit (GPGPU) standard benchmark suites. We show that NeDa’s performance is within 13.2% of an ideal GPU with no overhead for nearest-neighbor data exchange.