Microarchitectural performance characterization of irregular GPU kernels

1 October 2014

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

p. 130-139
https://doi.org/10.1109/iiswc.2014.6983052

Abstract

GPUs are increasingly being used to accelerate general-purpose applications, including applications with data-dependent, irregular memory access patterns and control flow. However, relatively little is known about the behavior of irregular GPU codes, and there has been minimal effort to quantify the ways in which they differ from regular GPGPU applications. We examine the behavior of a suite of optimized irregular CUDA applications on a cycle-accurate GPU simulator. We characterize the performance bottlenecks in each program and connect source code with microarchitectural characteristics. We also assess the impact of improvements in cache and DRAM bandwidth and latency and discuss the implications for GPU architecture design. We find that, while irregular graph codes exhibit significantly more underutilized execution cycles due to branch divergence, load imbalance, and synchronization overhead than regular programs, these factors contribute less to performance degradation than we expected. It appears that code optimizations are often able to effectively address these performance hurdles. Insufficient bandwidth and long memory latency are the biggest limiters of performance. Surprisingly, we find that applications with irregular memory access patterns are more sensitive to changes in L2 latency and bandwidth than DRAM latency and bandwidth.

This publication has 16 references indexed in Scilit:

Characterizing the latency hiding ability of GPUs
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2014
Atomic-free irregular computations on GPUs
Published by Association for Computing Machinery (ACM) ,2013
Morph algorithms on GPUs
Published by Association for Computing Machinery (ACM) ,2013
A quantitative study of irregular programs on GPUs
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2012
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-Body Algorithm
Published by Elsevier BV ,2011
A performance study of general-purpose applications on graphics processors using CUDA
Journal of Parallel and Distributed Computing, 2008
A Survey of General‐Purpose Computation on Graphics Hardware
Computer Graphics Forum, 2007
Focused Community Discovery
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2006
Survey propagation: An algorithm for satisfiability
Random Structures & Algorithms, 2005
A hierarchical O(N log N) force-calculation algorithm
Nature, 1986

Cited by 28 articles