Understanding latent sector errors and how to protect against them

28 September 2010

journal article
Published by Association for Computing Machinery (ACM) in ACM Transactions on Storage

Vol. 6 (3), 1-23
https://doi.org/10.1145/1837915.1837917

Abstract

Latent sector errors (LSEs) refer to the situation where particular sectors on a drive become inaccessible. LSEs are a critical factor in data reliability, since a single LSE can lead to data loss when encountered during RAID reconstruction after a disk failure or in systems without redundancy. LSEs happen at a significant rate in the field [Bairavasundaram et al. 2007], and are expected to grow more frequent with new drive technologies and increasing drive capacities. While two approaches, data scrubbing and intra-disk redundancy, have been proposed to reduce data loss due to LSEs, none of these approaches has been evaluated on real field data. This article makes two contributions. We provide an extended statistical analysis of latent sector errors in the field, specifically from the view point of how to protect against LSEs. In addition to providing interesting insights into LSEs, we hope the results (including parameters for models we fit to the data) will help researchers and practitioners without access to data in driving their simulations or analysis of LSEs. Our second contribution is an evaluation of five different scrubbing policies and five different intra-disk redundancy schemes and their potential in protecting against LSEs. Our study includes schemes and policies that have been suggested before, but have never been evaluated on field data, as well as new policies that we propose based on our analysis of LSEs in the field.

Keywords

This publication has 13 references indexed in Scilit:

The Raid-6 Liber8Tion Code
The International Journal of High Performance Computing Applications, 2009
Hard-disk drives
Communications of the ACM, 2009
Disk scrubbing versus intra-disk redundancy for high-reliability raid storage systems
Published by Association for Computing Machinery (ACM) ,2008
A new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors
ACM Transactions on Storage, 2008
Improving file system reliability with I/O shepherding
Published by Association for Computing Machinery (ACM) ,2007
An analysis of latent sector errors in disk drives
Published by Association for Computing Machinery (ACM) ,2007
HoVer Erasure Codes For Disk Arrays
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2006
IRON file systems
Published by Association for Computing Machinery (ACM) ,2005
A case for redundant arrays of inexpensive disks (RAID)
ACM SIGMOD Record, 1988
A fast file system for UNIX
ACM Transactions on Computer Systems, 1984

Cited by 82 articles