An Empirical Failure-Analysis of a Large-Scale Cloud Computing Environment

Abstract

Cloud computing research is in great need of statistical parameters derived from the analysis of real-world systems. One aspect of this is the failure characteristics of Cloud environments composed of workloads and servers, currently, few metrics are available that quantify failure and repair times of workloads and servers at a large-scale. Workload metrics in particular are critical for characterizing and modeling accurate workload behavior, enabling more realistic workload simulation and failure scenarios of systems. This paper presents the analysis of failure data of a large-scale production Cloud environment (consisting of over 12,500 servers), and includes a study of failure and repair times and characteristics for both Cloud workloads and servers. Our results show that failure characteristics for workload and servers are highly variable and that production Cloud workloads can be accurately modeled by a Gamma distribution. Repair times range between 30 seconds to 4 days, and 25 minutes to 8 days, for workloads and servers respectively.

Keywords

This publication has 22 references indexed in Scilit:

An Analysis of the Server Characteristics and Resource Utilization in Google Cloud
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2013
Cloud Incident Data: An Empirical Analysis
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2013
Quantifying event correlations for proactive failure management in networked computing systems
Journal of Parallel and Distributed Computing, 2010
InterCloud: Utility-Oriented Federation of Cloud Computing Environments for Scaling of Application Services
Lecture Notes in Computer Science, 2010
Hive
Proceedings of the VLDB Endowment, 2009
Cooperative checkpointing
Published by Association for Computing Machinery (ACM) ,2006
Basic concepts and taxonomy of dependable and secure computing
IEEE Transactions on Dependable and Secure Computing, 2004
Failure data analysis of a LAN of Windows NT based computers
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2003
Improving cluster availability using workstation validation
Published by Association for Computing Machinery (ACM) ,2002
A census of Tandem system availability between 1985 and 1990
IEEE Transactions on Reliability, 1990

Cited by 50 articles