Understanding latent sector errors and how to protect against them
- 28 September 2010
- journal article
- Published by Association for Computing Machinery (ACM) in ACM Transactions on Storage
- Vol. 6 (3), 1-23
- https://doi.org/10.1145/1837915.1837917
Abstract
Latent sector errors (LSEs) refer to the situation where particular sectors on a drive become inaccessible. LSEs are a critical factor in data reliability, since a single LSE can lead to data loss when encountered during RAID reconstruction after a disk failure or in systems without redundancy. LSEs happen at a significant rate in the field [Bairavasundaram et al. 2007], and are expected to grow more frequent with new drive technologies and increasing drive capacities. While two approaches, data scrubbing and intra-disk redundancy, have been proposed to reduce data loss due to LSEs, none of these approaches has been evaluated on real field data. This article makes two contributions. We provide an extended statistical analysis of latent sector errors in the field, specifically from the view point of how to protect against LSEs. In addition to providing interesting insights into LSEs, we hope the results (including parameters for models we fit to the data) will help researchers and practitioners without access to data in driving their simulations or analysis of LSEs. Our second contribution is an evaluation of five different scrubbing policies and five different intra-disk redundancy schemes and their potential in protecting against LSEs. Our study includes schemes and policies that have been suggested before, but have never been evaluated on field data, as well as new policies that we propose based on our analysis of LSEs in the field.Keywords
This publication has 13 references indexed in Scilit:
- The Raid-6 Liber8Tion CodeThe International Journal of High Performance Computing Applications, 2009
- Hard-disk drivesCommunications of the ACM, 2009
- Disk scrubbing versus intra-disk redundancy for high-reliability raid storage systemsPublished by Association for Computing Machinery (ACM) ,2008
- A new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errorsACM Transactions on Storage, 2008
- Improving file system reliability with I/O shepherdingPublished by Association for Computing Machinery (ACM) ,2007
- An analysis of latent sector errors in disk drivesPublished by Association for Computing Machinery (ACM) ,2007
- HoVer Erasure Codes For Disk ArraysPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2006
- IRON file systemsPublished by Association for Computing Machinery (ACM) ,2005
- A case for redundant arrays of inexpensive disks (RAID)ACM SIGMOD Record, 1988
- A fast file system for UNIXACM Transactions on Computer Systems, 1984