Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU)
- 15 November 2012
- journal article
- Published by Springer Science and Business Media LLC in Synthesis Lectures on Computer Architecture
- Vol. 7 (2), 1-96
- https://doi.org/10.2200/s00451ed1v01y201209cac020
Abstract
General-purpose graphics processing units (GPGPU) have emerged as an important class of shared memory parallel processing architectures, with widespread deployment in every computer class from high-end supercomputers to embedded mobile platforms. Relative to more traditional multicore systems of today, GPGPUs have distinctly higher degrees of hardware multithreading (hundreds of hardware thread contexts vs. tens), a return to wide vector units (several tens vs. 1-10), memory architectures that deliver higher peak memory bandwidth (hundreds of gigabytes per second vs. tens), and smaller caches/scratchpad memories (less than 1 megabyte vs. 1-10 megabytes). In this book, we provide a high-level overview of current GPGPU architectures and programming models. We review the principles that are used in previous shared memory parallel platforms, focusing on recent results in both the theory and practice of parallel algorithms, and suggest a connection to GPGPU platforms. We aim to provide hints to archit...Keywords
This publication has 73 references indexed in Scilit:
- Staged memory scheduling: Achieving high performance and scalability in heterogeneous systemsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2012
- Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processorsPublished by Association for Computing Machinery (ACM) ,2012
- Low depth cache-oblivious algorithmsPublished by Association for Computing Machinery (ACM) ,2010
- An adaptive performance modeling tool for GPU architecturesPublished by Association for Computing Machinery (ACM) ,2010
- Analyzing CUDA workloads using a detailed GPU simulatorPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2009
- Fundamental parallel algorithms for private-cache chip multiprocessorsPublished by Association for Computing Machinery (ACM) ,2008
- The data locality of work stealingPublished by Association for Computing Machinery (ACM) ,2000
- Programming parallel algorithmsCommunications of the ACM, 1996
- The input/output complexity of sorting and related problemsCommunications of the ACM, 1988
- Validity of the single processor approach to achieving large scale computing capabilitiesPublished by Association for Computing Machinery (ACM) ,1967