A Practical Approach to Hard Disk Failure Prediction in Cloud Platforms: Big Data Model for Failure Management in Datacenters

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE) in 2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService)

p. 105-116
https://doi.org/10.1109/bigdataservice.2016.10

Abstract

Large scale cloud platforms can benefit from a service that runs a machine learning model to predict disk drive failures. Unlike previous studies in this space, we have combined multiple data inputs for the model and obtained a better model performance compared to earlier published models. In this paper we explain how we developed and deployed the predictive model in a large scale cloud service. To build the model, we used a combination of two open data sources - Self-Monitoring, Analysis and Reporting technology (S.M.A.R.T or SMART) data and Windows performance counters. The nature of both these data sources is different and complex. The paper provides unique ways of parsing and transforming the data to make it most suited for a classification problem. Trails with different machine learning (ML) and statistical modeling techniques led us to the best performing two-stage ensemble model. We implemented this model to be configurable such that it could be deployed on large scale distributed cloud management systems and iterated on with minimal code impact. We provide a glimpse of the complex cloud hardware ecosystem and how a predictive model would impact such an ecosystem. Although our study focused on hard disk drives, we believe a similar modeling approach can apply to other hardware components as well. A successfully executed hard disk failure prediction model can pre-empt negative impact to client workloads and improve the economics of running a large scale cloud service. We provide the details of our model as a possible template for future extensions and improvements towards building more robust hardware fault prediction services. Finally we give a staged approach to operationalizing the model in large scale cloud systems.

Keywords

This publication has 5 references indexed in Scilit:

Vigilant
ACM SIGOPS Operating Systems Review, 2008
Failure Detection in Large-Scale Internet Services by Principal Subspace Mapping
IEEE Transactions on Knowledge and Data Engineering, 2007
Autopilot
ACM SIGOPS Operating Systems Review, 2007
Ensembles of Models for Automated Diagnosis of System Performance Problems
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2005
The Google file system
Published by Association for Computing Machinery (ACM) ,2003

Cited by 28 articles