Performance Prediction for Apache Spark Platform
- 1 August 2015
- conference paper
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
Abstract
Apache Spark is an open source distributed data processing platform that uses distributed memory abstraction to process large volume of data efficiently. However, performance of a particular job on Apache Spark platform can vary significantly depending on the input data type and size, design and implementation of the algorithm, and computing capability, making it extremely difficult to predict the performance metric of a job such as execution time, memory footprint, and I/O cost. To address this challenge, in this paper, we present a simulation driven prediction model that can predict job performance with high accuracy for Apache Spark platform. Specifically, as Apache spark jobs are often consist of multiple sequential stages, the presented prediction model simulates the execution of the actual job by using only a fraction of the input data, and collect execution traces (e.g., I/O overhead, memory consumption, execution time) to predict job performance for each execution stage individually. We evaluated our prediction framework using four real-life applications on a 13 node cluster, and experimental results show that the model can achieve high prediction accuracy.Keywords
This publication has 10 references indexed in Scilit:
- Enhancing Performance Prediction Robustness by Combining Analytical Modeling and Machine LearningPublished by Association for Computing Machinery (ACM) ,2015
- Performance evaluation of NoSQL big-data applications using multi-formalism modelsFuture Generation Computer Systems, 2014
- PREDIcTProceedings of the VLDB Endowment, 2013
- AnantaPublished by Association for Computing Machinery (ACM) ,2013
- Performance and resource modeling in highly-concurrent OLTP workloadsPublished by Association for Computing Machinery (ACM) ,2013
- Performance Modeling of MapReduce Jobs in Heterogeneous Cloud EnvironmentsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2013
- PREPARE: Predictive Performance Anomaly Prevention for Virtualized Cloud SystemsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2012
- Optimization of Resource Provisioning Cost in Cloud ComputingIEEE Transactions on Services Computing, 2011
- Statistics-driven workload modeling for the CloudPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2010
- MapReduceCommunications of the ACM, 2008