Efficient Nearest-Neighbor Data Sharing in GPUs

30 December 2020

journal article
research article
Published by Association for Computing Machinery (ACM) in ACM Transactions on Architecture and Code Optimization

Vol. 18 (1), 1-26
https://doi.org/10.1145/3429981

Abstract

Stencil codes (a.k.a. nearest-neighbor computations) are widely used in image processing, machine learning, and scientific applications. Stencil codes incur nearest-neighbor data exchange because the value of each point in the structured grid is calculated as a function of its value and the values of a subset of its nearest-neighbor points. When running on Graphics Processing Unit (GPUs), stencil codes exhibit a high degree of data sharing between nearest-neighbor threads. Sharing is typically implemented through shared memories, shuffle instructions, and on-chip caches and often incurs performance overheads due to the redundancy in memory accesses. In this article, we propose Neighbor Data (NeDa), a direct nearest-neighbor data sharing mechanism that uses two registers embedded in each streaming processor (SP) that can be accessed by nearest-neighbor SP cores. The registers are compiler-allocated and serve as a data exchange mechanism to eliminate nearest-neighbor shared accesses. NeDa is embedded carefully with local wires between SP cores so as to minimize the impact on density. We place and route NeDa in an open-source GPU and show a small area overhead of 1.3%. The cycle-accurate simulation indicates an average performance improvement of 21.8% and power reduction of up to 18.3% for stencil codes in General-Purpose Graphics Processing Unit (GPGPU) standard benchmark suites. We show that NeDa’s performance is within 13.2% of an ideal GPU with no overhead for nearest-neighbor data exchange.

Keywords

This publication has 64 references indexed in Scilit:

Enabling GPGPU Low-Level Hardware Explorations with MIAOW
ACM Transactions on Architecture and Code Optimization, 2015
CAWA
Published by Association for Computing Machinery (ACM) ,2015
A variable warp size architecture
Published by Association for Computing Machinery (ACM) ,2015
Threaded MPI programming model for the Epiphany RISC array processor
Journal of Computational Science, 2015
GPUWattch
ACM SIGARCH Computer Architecture News, 2013
Warp size impact in GPUs
Published by Association for Computing Machinery (ACM) ,2013
Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU)
Synthesis Lectures on Computer Architecture, 2012
A survey on application mapping strategies for Network-on-Chip design
Journal of Systems Architecture, 2012
Flexible router architecture for network-on-chip
Computers & Mathematics with Applications, 2012
Lead Iodide Perovskite Sensitized All-Solid-State Submicron Thin Film Mesoscopic Solar Cell with Efficiency Exceeding 9%
Scientific Reports, 2012

Cited by 2 articles