Multi-level comparison of data deduplication in a backup scenario
- 4 May 2009
- conference paper
- conference paper
- Published by Association for Computing Machinery (ACM)
Abstract
Data deduplication systems detect redundancies between data blocks to either reduce storage needs or to reduce network traffic. A class of deduplication systems splits the data stream into data blocks (chunks) and then finds exact duplicates of these blocks. This paper compares the influence of different chunking approaches on multiple levels. On a macroscopic level, we compare the chunking approaches based on real-life user data in a weekly full backup scenario, both at a single point in time as well as over several weeks. In addition, we analyze how small changes affect the deduplication ratio for different file types on a microscopic level for chunking approaches and delta encoding. An intuitive assumption is that small semantic changes on documents cause only small modifications in the binary representation of files, which would imply a high ratio of deduplication. We will show that this assumption is not valid for many important file types and that application-specific chunking can help to further decrease storage capacity demands.Keywords
This publication has 5 references indexed in Scilit:
- Demystifying data deduplicationPublished by Association for Computing Machinery (ACM) ,2008
- ADMAD: Application-Driven Metadata Aware De-duplication Archival Storage SystemPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2008
- Deep Store: An Archival Storage System ArchitecturePublished by Institute of Electrical and Electronics Engineers (IEEE) ,2005
- PasticheACM SIGOPS Operating Systems Review, 2002
- A low-bandwidth network file systemPublished by Association for Computing Machinery (ACM) ,2001