A Novel Method for Tuning Configuration Parameters of Spark Based on Machine Learning
- 1 December 2016
- conference paper
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE) in 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS)
Abstract
Apache Spark is an open source distributed data processing platform, which can use distributed memory abstraction to process large volume of data efficiently. With the application of Apache Spark more and more widely, some problems are exposed. One of the most important aspects is the performance problem. Apache Spark has more than 180 configuration parameters, which can be adjusted by users according to their own specific application so as to optimize the performance. Currently these parameters are tuned manually by trial and error, which is ineffective due to the large parameter space and the complex interactions among the parameters. In this paper, in order to make the parameter tuning process of Spark more effective, a novel method for tuning configuration of Spark based on machine learning is proposed, which is composed of binary classification and multi-classification. This method can be used to auto-tune the configuration parameters of Spark. Furthermore, several common machine learning algorithms based on the proposed method are explored, and experimental results show that decision tree model (C5.0) is the best model considering the accuracy and computational efficiency. Finally, the experimental results also show that the performance can get average 36% gain with the proposed method compared with the default configuration of Spark.Keywords
This publication has 7 references indexed in Scilit:
- Performance Prediction for Apache Spark PlatformPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2015
- Finding the Big Data Sweet Spot: Towards Automatically Recommending Configurations for Hadoop Clusters on Docker ContainersPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2015
- A self-tuning system based on application Profiling and Performance Analysis for optimizing Hadoop MapReduce cluster configurationPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2013
- Towards Machine Learning-Based Auto-tuning of MapReducePublished by Institute of Electrical and Electronics Engineers (IEEE) ,2013
- AROMAPublished by Association for Computing Machinery (ACM) ,2012
- A recursive random search algorithm for large-scale network parameter configurationPublished by Association for Computing Machinery (ACM) ,2003
- Approximation capabilities of multilayer feedforward networksNeural Networks, 1991