General transformations for GPU execution of tree traversals
- 17 November 2013
- conference paper
- conference paper
- Published by Association for Computing Machinery (ACM) in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Abstract
With the advent of programmer-friendly GPU computing environments, there has been much interest in offloading workloads that can exploit the high degree of parallelism available on modern GPUs. Exploiting this parallelism and optimizing for the GPU memory hierarchy is well-understood for regular applications that operate on dense data structures such as arrays and matrices. However, there has been significantly less work in the area of irregular algorithms and even less so when pointer-based dynamic data structures are involved. Recently, irregular algorithms such as Barnes-Hut and kd-tree traversals have been implemented on GPUs, yielding significant performance gains over CPU implementations. However, the implementations often rely on exploiting application-specific semantics to get acceptable performance. We argue that there are general-purpose techniques for implementing irregular algorithms on GPUs that exploit similarities in algorithmic structure rather than application-specific knowledge. We demonstrate these techniques on several tree traversal algorithms, achieving speedups of up to 38x over 32--thread CPU versions.Keywords
Funding Information
- Division of Computing and Communication Foundations (CCF-1150013)
- U.S. Department of Energy (DE-SC0010295)
This publication has 23 references indexed in Scilit:
- Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPUPublished by Association for Computing Machinery (ACM) ,2013
- Automatically enhancing locality for tree traversals with traversal splicingPublished by Association for Computing Machinery (ACM) ,2012
- A GPU implementation of inclusion-based points-to analysisPublished by Association for Computing Machinery (ACM) ,2012
- Enhancing locality for recursive traversals of recursive structuresPublished by Association for Computing Machinery (ACM) ,2011
- An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-Body AlgorithmPublished by Elsevier BV ,2011
- Cache-oblivious ray reorderingACM Transactions on Graphics, 2010
- Stackless KD‐Tree Traversal for High Performance GPU Ray TracingComputer Graphics Forum, 2007
- Fast Algorithms and Efficient Statistics: N-Point Correlation FunctionsPublished by Springer Science and Business Media LLC ,2000
- Load Balancing and Data Locality in Adaptive Hierarchical N-Body Methods: Barnes-Hut, Fast Multipole, and RadiosityJournal of Parallel and Distributed Computing, 1995
- Vectorization of a treecodeJournal of Computational Physics, 1990