Multi-level comparison of data deduplication in a backup scenario

4 May 2009

conference paper
conference paper
Published by Association for Computing Machinery (ACM)

Abstract

Data deduplication systems detect redundancies between data blocks to either reduce storage needs or to reduce network traffic. A class of deduplication systems splits the data stream into data blocks (chunks) and then finds exact duplicates of these blocks. This paper compares the influence of different chunking approaches on multiple levels. On a macroscopic level, we compare the chunking approaches based on real-life user data in a weekly full backup scenario, both at a single point in time as well as over several weeks. In addition, we analyze how small changes affect the deduplication ratio for different file types on a microscopic level for chunking approaches and delta encoding. An intuitive assumption is that small semantic changes on documents cause only small modifications in the binary representation of files, which would imply a high ratio of deduplication. We will show that this assumption is not valid for many important file types and that application-specific chunking can help to further decrease storage capacity demands.

Keywords

This publication has 5 references indexed in Scilit:

Demystifying data deduplication
Published by Association for Computing Machinery (ACM) ,2008
ADMAD: Application-Driven Metadata Aware De-duplication Archival Storage System
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2008
Deep Store: An Archival Storage System Architecture
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2005
Pastiche
ACM SIGOPS Operating Systems Review, 2002
A low-bandwidth network file system
Published by Association for Computing Machinery (ACM) ,2001

Cited by 80 articles