A Novel Method for Tuning Configuration Parameters of Spark Based on Machine Learning

Abstract
Apache Spark is an open source distributed data processing platform, which can use distributed memory abstraction to process large volume of data efficiently. With the application of Apache Spark more and more widely, some problems are exposed. One of the most important aspects is the performance problem. Apache Spark has more than 180 configuration parameters, which can be adjusted by users according to their own specific application so as to optimize the performance. Currently these parameters are tuned manually by trial and error, which is ineffective due to the large parameter space and the complex interactions among the parameters. In this paper, in order to make the parameter tuning process of Spark more effective, a novel method for tuning configuration of Spark based on machine learning is proposed, which is composed of binary classification and multi-classification. This method can be used to auto-tune the configuration parameters of Spark. Furthermore, several common machine learning algorithms based on the proposed method are explored, and experimental results show that decision tree model (C5.0) is the best model considering the accuracy and computational efficiency. Finally, the experimental results also show that the performance can get average 36% gain with the proposed method compared with the default configuration of Spark.

This publication has 7 references indexed in Scilit: