A Practical Approach to Hard Disk Failure Prediction in Cloud Platforms: Big Data Model for Failure Management in Datacenters

Abstract
Large scale cloud platforms can benefit from a service that runs a machine learning model to predict disk drive failures. Unlike previous studies in this space, we have combined multiple data inputs for the model and obtained a better model performance compared to earlier published models. In this paper we explain how we developed and deployed the predictive model in a large scale cloud service. To build the model, we used a combination of two open data sources - Self-Monitoring, Analysis and Reporting technology (S.M.A.R.T or SMART) data and Windows performance counters. The nature of both these data sources is different and complex. The paper provides unique ways of parsing and transforming the data to make it most suited for a classification problem. Trails with different machine learning (ML) and statistical modeling techniques led us to the best performing two-stage ensemble model. We implemented this model to be configurable such that it could be deployed on large scale distributed cloud management systems and iterated on with minimal code impact. We provide a glimpse of the complex cloud hardware ecosystem and how a predictive model would impact such an ecosystem. Although our study focused on hard disk drives, we believe a similar modeling approach can apply to other hardware components as well. A successfully executed hard disk failure prediction model can pre-empt negative impact to client workloads and improve the economics of running a large scale cloud service. We provide the details of our model as a possible template for future extensions and improvements towards building more robust hardware fault prediction services. Finally we give a staged approach to operationalizing the model in large scale cloud systems.

This publication has 5 references indexed in Scilit: