Machine-Learning Based Memory Prediction Model for Data Parallel Workloads in Apache Spark

Open Access

16 April 2021

journal article
research article
Published by MDPI AG in Symmetry

Vol. 13 (4), 697
https://doi.org/10.3390/sym13040697

Abstract

A lack of memory can lead to job failures or increase processing times for garbage collection. However, if too much memory is provided, the processing time is only marginally reduced, and most of the memory is wasted. Many big data processing tasks are executed in cloud environments. When renting virtual resources in a cloud environment, it is necessary to pay the cost according to the specifications of resources (i.e., the number of virtual cores and the size of memory), as well as rental time. In this paper, given the type of workload and volume of the input data, we analyze the memory usage pattern and derive the efficient memory size of data-parallel workloads in Apache Spark. Then, we propose a machine-learning-based prediction model that determines the efficient memory for a given workload and data. To determine the validity of the proposed model, we applied it to data-parallel workloads which include a deep learning model. The predicted memory values were in close agreement with the actual amount of required memory. Additionally, the whole building time for the proposed model requires a maximum of 44% of the total execution time of a data-parallel workload. The proposed model can improve memory efficiency up to 1.89 times compared with the vanilla Spark setting.

Keywords

Funding Information

National Research Foundation of Korea (NRF-2018R1D1A1B07050377)

This publication has 20 references indexed in Scilit:

Mobile big data analytics using deep learning and apache spark
IEEE Network, 2016
Application-Aware Dynamic Fine-Grained Resource Provisioning in a Virtualized Cloud Data Center
IEEE Transactions on Automation Science and Engineering, 2015
Machine Learning-Based Configuration Parameter Tuning on Hadoop System
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2015
Proactive Memory Scaling of Virtualized Applications
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2015
Parallel implementation of multilayered neural networks based on Map-Reduce on cloud computing clusters
Soft Computing, 2015
Toward Scalable Systems for Big Data Analytics: A Technology Tutorial
IEEE Access, 2014
Discretized streams
Published by Association for Computing Machinery (ACM) ,2013
Support Vector Regression Model Based on Empirical Mode Decomposition and Auto Regression for Electric Load Forecasting
Energies, 2013
MapReduce
Communications of the ACM, 2008
The real-time specification for Java
Computer, 2000