PAVER
Open Access
- 8 June 2021
- journal article
- research article
- Published by Association for Computing Machinery (ACM) in ACM Transactions on Architecture and Code Optimization
- Vol. 18 (3), 1-26
- https://doi.org/10.1145/3451164
Abstract
The massive parallelism present in GPUs comes at the cost of reduced L1 and L2 cache sizes per thread, leading to serious cache contention problems such as thrashing. Hence, the data access locality of an application should be considered during thread scheduling to improve execution time and energy consumption. Recent works have tried to use the locality behavior of regular and structured applications in thread scheduling, but the difficult case of irregular and unstructured parallel applications remains to be explored. We present PAVER , a P riority- A ware V ertex schedul ER , which takes a graph-theoretic approach toward thread scheduling. We analyze the cache locality behavior among thread blocks ( TBs ) through a just-in-time compilation, and represent the problem using a graph representing the TBs and the locality among them. This graph is then partitioned to TB groups that display maximum data sharing, which are then assigned to the same streaming multiprocessor by the locality-aware TB scheduler. Through exhaustive simulation in Fermi, Pascal, and Volta architectures using a number of scheduling techniques, we show that PAVER reduces L2 accesses by 43.3%, 48.5%, and 40.21% and increases the average performance benefit by 29%, 49.1%, and 41.2% for the benchmarks with high inter-TB locality.Keywords
Funding Information
- NSF (CCF-1815643 and CCF-1907401)
This publication has 42 references indexed in Scilit:
- LaPermACM SIGARCH Computer Architecture News, 2016
- Revealing Critical Loads and Hidden Data Locality in GPGPU ApplicationsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2015
- Improving GPGPU resource utilization through alternative thread block schedulingPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2014
- Locality-aware task management for unstructured parallelismPublished by Association for Computing Machinery (ACM) ,2013
- Cache-Conscious Wavefront SchedulingPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2012
- HPCTOOLKIT: tools for performance analysis of optimized parallel programsConcurrency and Computation: Practice and Experience, 2009
- Cache aware optimization of stream programsACM SIGPLAN Notices, 2005
- Practical experience of the limitations of gprofSoftware: Practice and Experience, 1993
- Shortest Connection Networks And Some GeneralizationsBell System Technical Journal, 1957
- On the shortest spanning subtree of a graph and the traveling salesman problemProceedings of the American Mathematical Society, 1956