Productive Programming of GPU Clusters with OmpSs
- 1 May 2012
- conference paper
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- No. 15302075,p. 557-568
- https://doi.org/10.1109/ipdps.2012.58
Abstract
Clusters of GPUs are emerging as a new computational scenario. Programming them requires the use of hybrid models that increase the complexity of the applications, reducing the productivity of programmers. We present the implementation of OmpSs for clusters of GPUs, which supports asynchrony and heterogeneity for task parallelism. It is based on annotating a serial application with directives that are translated by the compiler. With it, the same program that runs sequentially in a node with a single GPU can run in parallel in multiple GPUs either local (single node) or remote (cluster of GPUs). Besides performing a task-based parallelization, the runtime system moves the data as needed between the different nodes and GPUs minimizing the impact of communication by using affinity scheduling, caching, and by overlapping communication with the computational task. We show several applications programmed with OmpSs and their performance with multiple GPUs in a local node and in remote nodes. The results show good tradeoff between performance and effort from the programmer.Keywords
This publication has 17 references indexed in Scilit:
- Performance analysis of a hybrid MPI/CUDA implementation of the NASLU benchmarkACM SIGMETRICS Performance Evaluation Review, 2011
- Unified Parallel C for GPU Clusters: Language Extensions and Compiler ImplementationLecture Notes in Computer Science, 2011
- OpenMPC: Extended OpenMP Programming and Tuning for GPUsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2010
- 190 TFlops Astrophysical N-body Simulation on a Cluster of GPUsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2010
- Effective communication and computation overlap with hybrid MPI/SMPSsPublished by Association for Computing Machinery (ACM) ,2010
- Accelerating high performance applications with CUDA and MPIPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2009
- Accelerating linpack with CUDA on heterogenous clustersPublished by Association for Computing Machinery (ACM) ,2009
- Message passing for GPGPU clusters: CudaMPIPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2009
- CellSs: Making it easier to program the Cell Broadband Engine processorIBM Journal of Research and Development, 2007
- Parallel Programmability and the Chapel LanguageThe International Journal of High Performance Computing Applications, 2007