Failure Analysis of Jobs in Compute Clouds: A Google Cluster Case Study

1 November 2014

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

p. 167-177
https://doi.org/10.1109/issre.2014.34

Abstract

In this paper, we analyze a workload trace from the Google cloud cluster and characterize the observed failures. The goal of our work is to improve the understanding of failures in compute clouds. We present the statistical properties of job and task failures, and attempt to correlate them with key scheduling constraints, node operations, and attributes of users in the cloud. We also explore the potential for early failure prediction, and anomaly detection for the jobs. Based on our results, we speculate that there are many opportunities to enhance the reliability of the applications running in the cloud, such as pro-active maintenance of nodes or limiting job resubmissions. We further find that resource usage patterns of the jobs can be leveraged by failure prediction techniques. Finally, we find that the termination statuses of jobs and tasks can be clustered into six dominant categories based on the user profiles.

Keywords

This publication has 24 references indexed in Scilit:

An Empirical Failure-Analysis of a Large-Scale Cloud Computing Environment
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2014
Characterizing Cloud Applications on a Google Data Center
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2013
Adaptive Anomaly Identification by Exploring Metric Subspace in Cloud Computing Infrastructures
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2013
Workload characterization on a production Hadoop cluster: A case study on Taobao
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2012
Characterization and Comparison of Cloud versus Grid Workloads
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2012
Characterizing Machines and Workloads on a Google Cluster
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2012
Failure-aware resource provisioning for hybrid Cloud infrastructure
Journal of Parallel and Distributed Computing, 2012
The Elements of Statistical Learning
Published by Springer Science and Business Media LLC ,2009
Energy aware scheduling for distributed real-time systems
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2004
Silhouettes: A graphical aid to the interpretation and validation of cluster analysis
Journal of Computational and Applied Mathematics, 1987

Cited by 60 articles