Performance Prediction for Apache Spark Platform

1 August 2015

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

p. 166-173
https://doi.org/10.1109/hpcc-css-icess.2015.246

Abstract

Apache Spark is an open source distributed data processing platform that uses distributed memory abstraction to process large volume of data efficiently. However, performance of a particular job on Apache Spark platform can vary significantly depending on the input data type and size, design and implementation of the algorithm, and computing capability, making it extremely difficult to predict the performance metric of a job such as execution time, memory footprint, and I/O cost. To address this challenge, in this paper, we present a simulation driven prediction model that can predict job performance with high accuracy for Apache Spark platform. Specifically, as Apache spark jobs are often consist of multiple sequential stages, the presented prediction model simulates the execution of the actual job by using only a fraction of the input data, and collect execution traces (e.g., I/O overhead, memory consumption, execution time) to predict job performance for each execution stage individually. We evaluated our prediction framework using four real-life applications on a 13 node cluster, and experimental results show that the model can achieve high prediction accuracy.

Keywords

This publication has 10 references indexed in Scilit:

Enhancing Performance Prediction Robustness by Combining Analytical Modeling and Machine Learning
Published by Association for Computing Machinery (ACM) ,2015
Performance evaluation of NoSQL big-data applications using multi-formalism models
Future Generation Computer Systems, 2014
PREDIcT
Proceedings of the VLDB Endowment, 2013
Ananta
Published by Association for Computing Machinery (ACM) ,2013
Performance and resource modeling in highly-concurrent OLTP workloads
Published by Association for Computing Machinery (ACM) ,2013
Performance Modeling of MapReduce Jobs in Heterogeneous Cloud Environments
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2013
PREPARE: Predictive Performance Anomaly Prevention for Virtualized Cloud Systems
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2012
Optimization of Resource Provisioning Cost in Cloud Computing
IEEE Transactions on Services Computing, 2011
Statistics-driven workload modeling for the Cloud
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2010
MapReduce
Communications of the ACM, 2008

Cited by 99 articles