Workload characterization on a production Hadoop cluster: A case study on Taobao

1 November 2012

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

p. 3-13
https://doi.org/10.1109/iiswc.2012.6402895

Abstract

MapReduce is becoming the state-of-the-art computing paradigm for processing large-scale datasets on a large cluster with tens or thousands of nodes. It has been widely used in various fields such as e-commerce, Web search, social networks, and scientific computation. Understanding the characteristics of MapReduce workloads is the key to achieving better configuration decisions and improving the system throughput. However, workload characterization of MapReduce, especially in a large-scale production environment, has not been well studied yet. To gain insight on MapReduce workloads, we collected a two-week workload trace from a 2,000-node Hadoop cluster at Taobao, which is the biggest online e-commerce enterprise in Asia, ranked 14 ^th in the world as reported by Alexa. The workload trace covered 912,157 jobs, logged from Dec. 4 to Dec. 20, 2011. We characterized the workload at the granularity of job and task, respectively and concluded with a set of interesting observations. The results of workload characterization are representative and generally consistent with data platforms for e-commerce websites, which can help other researchers and engineers understand the performance and job characteristics of Hadoop in their production environments. In addition, we use these job analysis statistics to derive several implications for potential performance optimization solutions.

Keywords

This publication has 24 references indexed in Scilit:

Delay scheduling
Published by Association for Computing Machinery (ACM) ,2010
Towards characterizing cloud backend workloads
ACM SIGMETRICS Performance Evaluation Review, 2010
Web server performance analysis using histogram workload models
Computer Networks, 2009
Workload characterization in a high-energy data grid and impact on resource management
Cluster Computing, 2009
An analysis of data corruption in the storage stack
ACM Transactions on Storage, 2008
Statistical Analysis and Modeling of Jobs in a Grid Environment
Journal of Grid Computing, 2007
Workload Analysis of a Cluster in a Grid Environment
Lecture Notes in Computer Science, 2005
A Scalable Distributed Shared Memory Architecture
Journal of Parallel and Distributed Computing, 1994
Empirically derived analytic models of wide-area TCP connections
IEEE/ACM Transactions on Networking, 1994
A Proof for the Queuing Formula: L = λW
Operations Research, 1961

Cited by 71 articles