Accelerating CUDA graph algorithms at maximum warp

12 February 2011

conference paper
conference paper
Published by Association for Computing Machinery (ACM)

Vol. 46 (8), 267-276
https://doi.org/10.1145/1941553.1941590

Abstract

Graphs are powerful data representations favored in many computational domains. Modern GPUs have recently shown promising results in accelerating computationally challenging graph problems but their performance suffered heavily when the graph structure is highly irregular, as most real-world graphs tend to be. In this study, we first observe that the poor performance is caused by work imbalance and is an artifact of a discrepancy between the GPU programming model and the underlying GPU architecture.We then propose a novel virtual warp-centric programming method that exposes the traits of underlying GPU architectures to users. Our method significantly improves the performance of applications with heavily imbalanced workloads, and enables trade-offs between workload imbalance and ALU underutilization for fine-tuning the performance. Our evaluation reveals that our method exhibits up to 9x speedup over previous GPU algorithms and 12x over single thread CPU execution on irregular graphs. When properly configured, it also yields up to 30% improvement over previous GPU algorithms on regular graphs. In addition to performance gains on graph algorithms, our programming method achieves 1.3x to 15.1x speedup on a set of GPU benchmark applications. Our study also confirms that the performance gap between GPUs and other multi-threaded CPU graph implementations is primarily due to the large difference in memory bandwidth.

This publication has 14 references indexed in Scilit:

Language virtualization for heterogeneous parallel computing
Published by Association for Computing Machinery (ACM) ,2010
Debunking the 100X GPU vs. CPU myth
Published by Association for Computing Machinery (ACM) ,2010
The GPU Computing Era
IEEE Micro, 2010
Rodinia: A benchmark suite for heterogeneous computing
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2009
SNAP, Small-world Network Analysis and Partitioning: An open-source parallel graph framework for the exploration of large-scale networks
2008 IEEE International Symposium on Parallel and Distributed Processing, 2008
Graph Analysis with High-Performance Computing
Computing in Science & Engineering, 2008
Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2006
A Parallel External-Memory Frontier Breadth-First Traversal Algorithm for Clusters of Workstations
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2006
A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2005
R-MAT: A Recursive Model for Graph Mining
Published by Society for Industrial & Applied Mathematics (SIAM) ,2004

Cited by 197 articles