Extreme Binning: Scalable, parallel deduplication for chunk-based file backup
Top Cited Papers
- 1 September 2009
- conference paper
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
Abstract
Data deduplication is an essential and critical component of backup systems. Essential, because it reduces storage space requirements, and critical, because the performance of the entire backup operation depends on its throughput. Traditional backup workloads consist of large data streams with high locality, which existing deduplication techniques require to provide reasonable throughput. We present Extreme Binning, a scalable deduplication technique for non-traditional backup workloads that are made up of individual files with no locality among consecutive files in a given window of time. Due to lack of locality, existing techniques perform poorly on these workloads. Extreme Binning exploits file similarity instead of locality, and makes only one disk access for chunk lookup per file, which gives reasonable throughput. Multi-node backup systems built with Extreme Binning scale gracefully with the amount of input data; more backup nodes can be added to boost throughput. Each file is allocated using a stateless routing algorithm to only one node, allowing for maximum parallelization, and each backup node is autonomous with no dependency across nodes, making data management tasks robust with low overhead.Keywords
This publication has 17 references indexed in Scilit:
- The design of a similarity based deduplication systemPublished by Association for Computing Machinery (ACM) ,2009
- PAST: a large-scale, persistent peer-to-peer storage utilityPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2005
- Tapestry: A Resilient Global-Scale Overlay for Service DeploymentIEEE Journal on Selected Areas in Communications, 2004
- On the resemblance and containment of documentsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Compactly encoding unstructured inputs with differential compressionJournal of the ACM, 2002
- PastichePublished by Association for Computing Machinery (ACM) ,2002
- Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer SystemsLecture Notes in Computer Science, 2001
- ChordPublished by Association for Computing Machinery (ACM) ,2001
- OceanStorePublished by Association for Computing Machinery (ACM) ,2000
- The MD5 Message-Digest Algorithm1992