Highly Concurrent Latency-tolerant Register Files for GPUs

30 November 2019

journal article
research article
Published by Association for Computing Machinery (ACM) in ACM Transactions on Computer Systems

Vol. 37 (1-4), 1-36
https://doi.org/10.1145/3419973

Abstract

Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. In this article, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical structure while keeping power consumption low. We observe that compile-time interval analysis enables us to divide GPU program execution into intervals with an accurate estimate of a warp’s aggregate register working-set within each interval. The key idea of LTRF is to prefetch the estimated register working-set from the main register file to the register file cache under software control, at the beginning of each interval, and overlap the prefetch latency with the execution of other warps. We observe that register bank conflicts while prefetching the registers could greatly reduce the effectiveness of LTRF. Therefore, we devise a compile-time register renumbering technique to reduce the likelihood of register bank conflicts. Our experimental results show that LTRF enables high-capacity yet long-latency main GPU register files, paving the way for various optimizations. As an example optimization, we implement the main register file with emerging high-density high-latency memory technologies, enabling 8× larger capacity and improving overall GPU performance by 34%.

Keywords

This publication has 100 references indexed in Scilit:

Fast domain wall motion in magnetic comb structures
Nature Materials, 2010
Energy-efficient register caching with compiler assistance
ACM Transactions on Architecture and Code Optimization, 2009
Spatial Memory Streaming
ACM SIGARCH Computer Architecture News, 2006
An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors
International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2005
Using a user-level memory thread for correlation prefetching
ACM SIGARCH Computer Architecture News, 2002
Linear scan register allocation
ACM Transactions on Programming Languages and Systems, 1999
Implementation and performance of integrated application-controlled file caching, prefetching, and disk scheduling
ACM Transactions on Computer Systems, 1996
Effective hardware-based data prefetching for high-performance processors
International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1995
Multiple prefetch adaptive disk caching
IEEE Transactions on Knowledge and Data Engineering, 1993
The CRAY-1 computer system
Communications of the ACM, 1978

Cited by 3 articles