XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE) in 2013 IEEE 27th International Symposium on Parallel and Distributed Processing

p. 1299-1308
https://doi.org/10.1109/ipdps.2013.66

Abstract

Most recent HPC platforms have heterogeneous nodes composed of multi-core CPUs and accelerators, like GPUs. Programming such nodes is typically based on a combination of OpenMP and CUDA/OpenCL codes; scheduling relies on a static partitioning and cost model. We present the XKaapi runtime system for data-flow task programming on multi-CPU and multi-GPU architectures, which supports a data-flow task model and a locality-aware work stealing scheduler. XKaapi enables task multi-implementation on CPU or GPU and multi-level parallelism with different grain sizes. We show performance results on two dense linear algebra kernels, matrix product (GEMM) and Cholesky factorization (POTRF), to evaluate XKaapi on a heterogeneous architecture composed of two hexa-core CPUs and eight NVIDIA Fermi GPUs. Our conclusion is two-fold. First, fine grained parallelism and online scheduling achieve performance results as good as static strategies, and in most cases outperform them. This is due to an improved work stealing strategy that includes locality information; a very light implementation of the tasks in XKaapi; and an optimized search for ready tasks. Next, the multi-level parallelism on multiple CPUs and GPUs enabled by XKaapi led to a highly efficient Cholesky factorization. Using eight NVIDIA Fermi GPUs and four CPUs, we measure up to 2.43 TFlop/s on double precision matrix product and 1.79 TFlop/s on Cholesky factorization; and respectively 5.09 TFlop/s and 3.92 TFlop/s in single precision.

Keywords

This publication has 21 references indexed in Scilit:

Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2012
A scalable framework for heterogeneous GPU-based clusters
Published by Association for Computing Machinery (ACM) ,2012
Decentralized list scheduling
Annals of Operations Research, 2012
Productive Programming of GPU Clusters with OmpSs
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2012
StarPU: a unified platform for task scheduling on heterogeneous multicore architectures
Concurrency and Computation: Practice and Experience, 2010
SLAW
Published by Association for Computing Machinery (ACM) ,2010
Parallelizing dense and banded linear algebra libraries using SMPSs
Concurrency and Computation: Practice and Experience, 2009
KAAPI
Published by Association for Computing Machinery (ACM) ,2007
Athapascan-1: On-line building data flow graph in a parallel language
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
The data locality of work stealing
Published by Association for Computing Machinery (ACM) ,2000

Cited by 123 articles