Productive Programming of GPU Clusters with OmpSs

1 May 2012

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

No. 15302075,p. 557-568
https://doi.org/10.1109/ipdps.2012.58

Abstract

Clusters of GPUs are emerging as a new computational scenario. Programming them requires the use of hybrid models that increase the complexity of the applications, reducing the productivity of programmers. We present the implementation of OmpSs for clusters of GPUs, which supports asynchrony and heterogeneity for task parallelism. It is based on annotating a serial application with directives that are translated by the compiler. With it, the same program that runs sequentially in a node with a single GPU can run in parallel in multiple GPUs either local (single node) or remote (cluster of GPUs). Besides performing a task-based parallelization, the runtime system moves the data as needed between the different nodes and GPUs minimizing the impact of communication by using affinity scheduling, caching, and by overlapping communication with the computational task. We show several applications programmed with OmpSs and their performance with multiple GPUs in a local node and in remote nodes. The results show good tradeoff between performance and effort from the programmer.

Keywords

This publication has 17 references indexed in Scilit:

Performance analysis of a hybrid MPI/CUDA implementation of the NASLU benchmark
ACM SIGMETRICS Performance Evaluation Review, 2011
Unified Parallel C for GPU Clusters: Language Extensions and Compiler Implementation
Lecture Notes in Computer Science, 2011
OpenMPC: Extended OpenMP Programming and Tuning for GPUs
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2010
190 TFlops Astrophysical N-body Simulation on a Cluster of GPUs
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2010
Effective communication and computation overlap with hybrid MPI/SMPSs
Published by Association for Computing Machinery (ACM) ,2010
Accelerating high performance applications with CUDA and MPI
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2009
Accelerating linpack with CUDA on heterogenous clusters
Published by Association for Computing Machinery (ACM) ,2009
Message passing for GPGPU clusters: CudaMPI
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2009
CellSs: Making it easier to program the Cell Broadband Engine processor
IBM Journal of Research and Development, 2007
Parallel Programmability and the Chapel Language
The International Journal of High Performance Computing Applications, 2007

Cited by 118 articles