PAVER

Open Access

8 June 2021

journal article
research article
Published by Association for Computing Machinery (ACM) in ACM Transactions on Architecture and Code Optimization

Vol. 18 (3), 1-26
https://doi.org/10.1145/3451164

Abstract

The massive parallelism present in GPUs comes at the cost of reduced L1 and L2 cache sizes per thread, leading to serious cache contention problems such as thrashing. Hence, the data access locality of an application should be considered during thread scheduling to improve execution time and energy consumption. Recent works have tried to use the locality behavior of regular and structured applications in thread scheduling, but the difficult case of irregular and unstructured parallel applications remains to be explored. We present PAVER , a P riority- A ware V ertex schedul ER , which takes a graph-theoretic approach toward thread scheduling. We analyze the cache locality behavior among thread blocks ( TBs ) through a just-in-time compilation, and represent the problem using a graph representing the TBs and the locality among them. This graph is then partitioned to TB groups that display maximum data sharing, which are then assigned to the same streaming multiprocessor by the locality-aware TB scheduler. Through exhaustive simulation in Fermi, Pascal, and Volta architectures using a number of scheduling techniques, we show that PAVER reduces L2 accesses by 43.3%, 48.5%, and 40.21% and increases the average performance benefit by 29%, 49.1%, and 41.2% for the benchmarks with high inter-TB locality.

Keywords

Funding Information

NSF (CCF-1815643 and CCF-1907401)

This publication has 42 references indexed in Scilit:

LaPerm
ACM SIGARCH Computer Architecture News, 2016
Revealing Critical Loads and Hidden Data Locality in GPGPU Applications
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2015
Improving GPGPU resource utilization through alternative thread block scheduling
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2014
Locality-aware task management for unstructured parallelism
Published by Association for Computing Machinery (ACM) ,2013
Cache-Conscious Wavefront Scheduling
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2012
HPCTOOLKIT: tools for performance analysis of optimized parallel programs
Concurrency and Computation: Practice and Experience, 2009
Cache aware optimization of stream programs
ACM SIGPLAN Notices, 2005
Practical experience of the limitations of gprof
Software: Practice and Experience, 1993
Shortest Connection Networks And Some Generalizations
Bell System Technical Journal, 1957
On the shortest spanning subtree of a graph and the traveling salesman problem
Proceedings of the American Mathematical Society, 1956

Cited by 11 articles