A self-tuning system based on application Profiling and Performance Analysis for optimizing Hadoop MapReduce cluster configuration
- 1 December 2013
- conference paper
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
Abstract
One of the most widely used frameworks for programming MapReduce-based applications is Apache Hadoop. Despite its popularity, however, application developers face numerous challenges in using the Hadoop framework, which stem from them having to effectively manage the resources of a MapReduce cluster, and configuring the framework in a way that will optimize the performance and reliability of MapReduce applications running on it. This paper addresses these problems by presenting the Profiling and Performance Analysis-based System (PPABS) framework, which automates the tuning of Hadoop configuration settings based on deduced application performance requirements. The PPABS framework comprises two distinct phases called the Analyzer, which trains PPABS to form a set of equivalence classes of MapReduce applications for which the most appropriate Hadoop config- uration parameters that maximally improve performance for that class are determined, and the Recognizer, which classifies an incoming unknown job to one of these equivalence classes so that its Hadoop configuration parameters can be self-tuned. The key research contributions in the Analyzer phase includes modifications to the well-known k - means + + clustering and Simulated Annealing algorithms, which were required to adapt them to the MapReduce paradigm. The key contributions in the Recognizer phase includes an approach to classify an unknown, incoming job to one of the equivalence classes and a control strategy to self-tune the Hadoop cluster configuration parameters for that job. Experimental results comparing the performance improvements for three different classes of applications running on Hadoop clusters deployed on Amazon EC2 show promising results.Keywords
This publication has 16 references indexed in Scilit:
- AROMAPublished by Association for Computing Machinery (ACM) ,2012
- An Analysis of Traces from a Production MapReduce ClusterPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2010
- A Dynamic MapReduce Scheduler for Heterogeneous WorkloadsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2009
- MapReduceCommunications of the ACM, 2008
- Reliable Effects Screening: A Distributed Continuous Quality Assurance Process for Monitoring Performance Degradation in Evolving Software SystemsIEEE Transactions on Software Engineering, 2007
- Simulated AnnealingPublished by Springer Science and Business Media LLC ,2006
- Data clusteringACM Computing Surveys, 1999
- Fast subsequence matching in time-series databasesACM SIGMOD Record, 1994
- Simulated Annealing: Theory and ApplicationsPublished by Springer Science and Business Media LLC ,1987
- MATCHING, MAXIMIZING, AND HILL‐CLIMBINGJournal of the Experimental Analysis of Behavior, 1983