XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures
- 1 May 2013
- conference paper
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE) in 2013 IEEE 27th International Symposium on Parallel and Distributed Processing
- p. 1299-1308
- https://doi.org/10.1109/ipdps.2013.66
Abstract
Most recent HPC platforms have heterogeneous nodes composed of multi-core CPUs and accelerators, like GPUs. Programming such nodes is typically based on a combination of OpenMP and CUDA/OpenCL codes; scheduling relies on a static partitioning and cost model. We present the XKaapi runtime system for data-flow task programming on multi-CPU and multi-GPU architectures, which supports a data-flow task model and a locality-aware work stealing scheduler. XKaapi enables task multi-implementation on CPU or GPU and multi-level parallelism with different grain sizes. We show performance results on two dense linear algebra kernels, matrix product (GEMM) and Cholesky factorization (POTRF), to evaluate XKaapi on a heterogeneous architecture composed of two hexa-core CPUs and eight NVIDIA Fermi GPUs. Our conclusion is two-fold. First, fine grained parallelism and online scheduling achieve performance results as good as static strategies, and in most cases outperform them. This is due to an improved work stealing strategy that includes locality information; a very light implementation of the tasks in XKaapi; and an optimized search for ready tasks. Next, the multi-level parallelism on multiple CPUs and GPUs enabled by XKaapi led to a highly efficient Cholesky factorization. Using eight NVIDIA Fermi GPUs and four CPUs, we measure up to 2.43 TFlop/s on double precision matrix product and 1.79 TFlop/s on Cholesky factorization; and respectively 5.09 TFlop/s and 3.92 TFlop/s in single precision.Keywords
This publication has 21 references indexed in Scilit:
- Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2012
- A scalable framework for heterogeneous GPU-based clustersPublished by Association for Computing Machinery (ACM) ,2012
- Decentralized list schedulingAnnals of Operations Research, 2012
- Productive Programming of GPU Clusters with OmpSsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2012
- StarPU: a unified platform for task scheduling on heterogeneous multicore architecturesConcurrency and Computation: Practice and Experience, 2010
- SLAWPublished by Association for Computing Machinery (ACM) ,2010
- Parallelizing dense and banded linear algebra libraries using SMPSsConcurrency and Computation: Practice and Experience, 2009
- KAAPIPublished by Association for Computing Machinery (ACM) ,2007
- Athapascan-1: On-line building data flow graph in a parallel languagePublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- The data locality of work stealingPublished by Association for Computing Machinery (ACM) ,2000