A Comparative Survey of the HPC and Big Data Paradigms: Analysis and Experiments

Abstract

Many scientific data analytic applications need huge amounts of input, which can often consist of more than several TBs of data. This emphasizes the high I/O and processing/computational cost requirements of these algorithms. Tasks in these programs can induce more I/O operations than computations or the opposite. Hardware also includes nodes with large storage devices and/or nodes with sophisticated computational capabilities. To embrace the heterogeneity of the hardware systems in non-cloud and cloud environments, the issues of resource and job allocation in these environments need to be revisited. High-Performance Computing models, under the leadership of MPI (plus OpenMP) parallel APIs, have mostly met users' requirements in terms of high computational performance, while Big Data frameworks such as Spark have performed likewise in terms of high-level programming, resiliency and I/O handling. Therefore, in order to meet the specialized needs of scientists, there is a need for convergence between HPC and Big Data ecosystems. This paper presents a data-supported, comparative survey of the main current HPC and Big Data programming interfaces, namely MPI, OpenMP, PGAS (OpenSHMEM), Spark, and Hadoop, and their software stacks. A comprehensive experimental study of these interfaces on a set of benchmarks, namely reduction and I/O microbenchmarks, the StackExchange AnswersCount benchmark, and PageRank Benchmark has been performed on a single platform in order to achieve a fair comparison. These experiments lead to a thorough discussion about whether the envisioned convergence is needed or not, efficient or not, and whether it is the best solution to tackle future computational challenges.

Keywords

This publication has 19 references indexed in Scilit:

Regent
Published by Association for Computing Machinery (ACM) ,2015
Accelerating Kirchhoff Migration on GPU Using Directives
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2014
XSEDE: Accelerating Scientific Discovery
Computing in Science & Engineering, 2014
OpenMP for Accelerators
Lecture Notes in Computer Science, 2011
Introducing OpenSHMEM
Published by Association for Computing Machinery (ACM) ,2010
The Scalasca performance toolset architecture
Concurrency and Computation: Practice and Experience, 2010
Towards Efficient MapReduce Using MPI
Lecture Notes in Computer Science, 2009
MapReduce
Communications of the ACM, 2008
The Tau Parallel Performance System
The International Journal of High Performance Computing Applications, 2006
Co-array Fortran for parallel programming
ACM SIGPLAN Fortran Forum, 1998

Cited by 19 articles