A Novel Method for Tuning Configuration Parameters of Spark Based on Machine Learning

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE) in 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS)

p. 586-593
https://doi.org/10.1109/hpcc-smartcity-dss.2016.0088

Abstract

Apache Spark is an open source distributed data processing platform, which can use distributed memory abstraction to process large volume of data efficiently. With the application of Apache Spark more and more widely, some problems are exposed. One of the most important aspects is the performance problem. Apache Spark has more than 180 configuration parameters, which can be adjusted by users according to their own specific application so as to optimize the performance. Currently these parameters are tuned manually by trial and error, which is ineffective due to the large parameter space and the complex interactions among the parameters. In this paper, in order to make the parameter tuning process of Spark more effective, a novel method for tuning configuration of Spark based on machine learning is proposed, which is composed of binary classification and multi-classification. This method can be used to auto-tune the configuration parameters of Spark. Furthermore, several common machine learning algorithms based on the proposed method are explored, and experimental results show that decision tree model (C5.0) is the best model considering the accuracy and computational efficiency. Finally, the experimental results also show that the performance can get average 36% gain with the proposed method compared with the default configuration of Spark.

Keywords

This publication has 7 references indexed in Scilit:

Performance Prediction for Apache Spark Platform
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2015
Finding the Big Data Sweet Spot: Towards Automatically Recommending Configurations for Hadoop Clusters on Docker Containers
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2015
A self-tuning system based on application Profiling and Performance Analysis for optimizing Hadoop MapReduce cluster configuration
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2013
Towards Machine Learning-Based Auto-tuning of MapReduce
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2013
AROMA
Published by Association for Computing Machinery (ACM) ,2012
A recursive random search algorithm for large-scale network parameter configuration
Published by Association for Computing Machinery (ACM) ,2003
Approximation capabilities of multilayer feedforward networks
Neural Networks, 1991

Cited by 54 articles