Efficient Nearest-Neighbor Data Sharing in GPUs
- 30 December 2020
- journal article
- research article
- Published by Association for Computing Machinery (ACM) in ACM Transactions on Architecture and Code Optimization
- Vol. 18 (1), 1-26
- https://doi.org/10.1145/3429981
Abstract
Stencil codes (a.k.a. nearest-neighbor computations) are widely used in image processing, machine learning, and scientific applications. Stencil codes incur nearest-neighbor data exchange because the value of each point in the structured grid is calculated as a function of its value and the values of a subset of its nearest-neighbor points. When running on Graphics Processing Unit (GPUs), stencil codes exhibit a high degree of data sharing between nearest-neighbor threads. Sharing is typically implemented through shared memories, shuffle instructions, and on-chip caches and often incurs performance overheads due to the redundancy in memory accesses. In this article, we propose Neighbor Data (NeDa), a direct nearest-neighbor data sharing mechanism that uses two registers embedded in each streaming processor (SP) that can be accessed by nearest-neighbor SP cores. The registers are compiler-allocated and serve as a data exchange mechanism to eliminate nearest-neighbor shared accesses. NeDa is embedded carefully with local wires between SP cores so as to minimize the impact on density. We place and route NeDa in an open-source GPU and show a small area overhead of 1.3%. The cycle-accurate simulation indicates an average performance improvement of 21.8% and power reduction of up to 18.3% for stencil codes in General-Purpose Graphics Processing Unit (GPGPU) standard benchmark suites. We show that NeDa’s performance is within 13.2% of an ideal GPU with no overhead for nearest-neighbor data exchange.Keywords
This publication has 64 references indexed in Scilit:
- Enabling GPGPU Low-Level Hardware Explorations with MIAOWACM Transactions on Architecture and Code Optimization, 2015
- CAWAPublished by Association for Computing Machinery (ACM) ,2015
- A variable warp size architecturePublished by Association for Computing Machinery (ACM) ,2015
- Threaded MPI programming model for the Epiphany RISC array processorJournal of Computational Science, 2015
- GPUWattchACM SIGARCH Computer Architecture News, 2013
- Warp size impact in GPUsPublished by Association for Computing Machinery (ACM) ,2013
- Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU)Synthesis Lectures on Computer Architecture, 2012
- A survey on application mapping strategies for Network-on-Chip designJournal of Systems Architecture, 2012
- Flexible router architecture for network-on-chipComputers & Mathematics with Applications, 2012
- Lead Iodide Perovskite Sensitized All-Solid-State Submicron Thin Film Mesoscopic Solar Cell with Efficiency Exceeding 9%Scientific Reports, 2012