Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures

27 May 2010

journal article
Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Parallel and Distributed Systems

Vol. 22 (1), 105-118
https://doi.org/10.1109/tpds.2010.107

Abstract

The introduction of General-Purpose computation on GPUs (GPGPUs) has changed the landscape for the future of parallel computing. At the core of this phenomenon are massively multithreaded, data-parallel architectures possessing impressive acceleration ratings, offering low-cost supercomputing together with attractive power budgets. Even given the numerous benefits provided by GPGPUs, there remain a number of barriers that delay wider adoption of these architectures. One major issue is the heterogeneous and distributed nature of the memory subsystem commonly found on data-parallel architectures. Application acceleration is highly dependent on being able to utilize the memory subsystem effectively so that all execution units remain busy. In this paper, we present techniques for enhancing the memory efficiency of applications on data-parallel architectures, based on the analysis and characterization of memory access patterns in loop bodies; we target vectorization via data transformation to benefit vector-based architectures (e.g., AMD GPUs) and algorithmic memory selection for scalar-based architectures (e.g., NVIDIA GPUs). We demonstrate the effectiveness of our proposed methods with kernels from a wide range of benchmark suites. For the benchmark kernels studied, we achieve consistent and significant performance improvements (up to 11.4× and 13.5× over baseline GPU implementations on each platform, respectively) by applying our proposed methodology.

Keywords

This publication has 15 references indexed in Scilit:

Architecture-aware optimization targeting multithreaded stream computing
Published by Association for Computing Machinery (ACM) ,2009
Access-Pattern-Aware On-Chip Memory Allocation for SIMD Processors
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2008
Efficient vectorization of SIMD programs with non-aligned and irregular data access hardware
Published by Association for Computing Machinery (ACM) ,2008
NVIDIA Tesla: A Unified Graphics and Computing Architecture
IEEE Micro, 2008
GPU Computing
Proceedings of the IEEE, 2008
Robust quasistatic finite elements and flesh simulation
Published by Association for Computing Machinery (ACM) ,2005
GPGPU
Published by Association for Computing Machinery (ACM) ,2004
Simple vector microprocessors for multimedia applications
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
Cache miss equations
Published by Association for Computing Machinery (ACM) ,1997
A comparative study of automatic vectorizing compilers
Parallel Computing, 1991

Cited by 137 articles