Understanding the Influence of Configuration Settings: An Execution Model-Driven Framework for Apache Spark Platform

Abstract
Apache Spark provides numerous configuration settings that can be tuned to improve the performance of specific applications running on the platform. However, due to its multi-stage execution model and high interactive complexity across nodes, it is nontrivial to understand how/why a specific setting influences the execution flow and performance. To address this challenge, we develop an execution model-driven framework that extracts key performance metrics relevant to different levels of execution (e.g., application level, stage level, task level, system level) and applies statistical analysis techniques to identify the key execution features that change significantly in response to changes in configuration settings. This allows users to answer questions such as "How does configuration setting X affect the execution behavior of Spark?" or "Why does changing configuration setting X degrade the performance of Spark application Y?". We tested our framework using 6 open source applications (e.g., Word Count, Tera Sort, KMeans, Matrix Factorization, PageRank, and Triangle Count) and demonstrated the effectiveness of our framework in identifying the underlying reasons behind changes in performance.

This publication has 5 references indexed in Scilit: