A self-tuning system based on application Profiling and Performance Analysis for optimizing Hadoop MapReduce cluster configuration

1 December 2013

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

p. 89-98
https://doi.org/10.1109/hipc.2013.6799133

Abstract

One of the most widely used frameworks for programming MapReduce-based applications is Apache Hadoop. Despite its popularity, however, application developers face numerous challenges in using the Hadoop framework, which stem from them having to effectively manage the resources of a MapReduce cluster, and configuring the framework in a way that will optimize the performance and reliability of MapReduce applications running on it. This paper addresses these problems by presenting the Profiling and Performance Analysis-based System (PPABS) framework, which automates the tuning of Hadoop configuration settings based on deduced application performance requirements. The PPABS framework comprises two distinct phases called the Analyzer, which trains PPABS to form a set of equivalence classes of MapReduce applications for which the most appropriate Hadoop config- uration parameters that maximally improve performance for that class are determined, and the Recognizer, which classifies an incoming unknown job to one of these equivalence classes so that its Hadoop configuration parameters can be self-tuned. The key research contributions in the Analyzer phase includes modifications to the well-known k - means + + clustering and Simulated Annealing algorithms, which were required to adapt them to the MapReduce paradigm. The key contributions in the Recognizer phase includes an approach to classify an unknown, incoming job to one of the equivalence classes and a control strategy to self-tune the Hadoop cluster configuration parameters for that job. Experimental results comparing the performance improvements for three different classes of applications running on Hadoop clusters deployed on Amazon EC2 show promising results.

Keywords

This publication has 16 references indexed in Scilit:

AROMA
Published by Association for Computing Machinery (ACM) ,2012
An Analysis of Traces from a Production MapReduce Cluster
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2010
A Dynamic MapReduce Scheduler for Heterogeneous Workloads
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2009
MapReduce
Communications of the ACM, 2008
Reliable Effects Screening: A Distributed Continuous Quality Assurance Process for Monitoring Performance Degradation in Evolving Software Systems
IEEE Transactions on Software Engineering, 2007
Simulated Annealing
Published by Springer Science and Business Media LLC ,2006
Data clustering
ACM Computing Surveys, 1999
Fast subsequence matching in time-series databases
ACM SIGMOD Record, 1994
Simulated Annealing: Theory and Applications
Published by Springer Science and Business Media LLC ,1987
MATCHING, MAXIMIZING, AND HILL‐CLIMBING
Journal of the Experimental Analysis of Behavior, 1983

Cited by 34 articles