Quantifying molecular bias in DNA data storage
Open Access
- 29 June 2020
- journal article
- research article
- Published by Springer Science and Business Media LLC in Nature Communications
- Vol. 11 (1), 1-9
- https://doi.org/10.1038/s41467-020-16958-3
Abstract
DNA has recently emerged as an attractive medium for archival data storage. Recent work has demonstrated proof-of-principle prototype systems; however, very uneven (biased) sequencing coverage has been reported, which indicates inefficiencies in the storage process. Deviations from the average coverage in the sequence copy distribution can either cause wasteful provisioning in sequencing or excessive number of missing sequences. Here, we use millions of unique sequences from a DNA-based digital data archival system to study the oligonucleotide copy unevenness problem and show that the two paramount sources of bias are the synthesis and amplification (PCR) processes. Based on these findings, we develop a statistical model for each molecular process as well as the overall process. We further use our model to explore the trade-offs between synthesis bias, storage physical density, logical redundancy, and sequencing redundancy, providing insights for engineering efficient, robust DNA data storage systems.Funding Information
- United States Department of Defense | Defense Advanced Research Projects Agency
- Microsoft
This publication has 31 references indexed in Scilit:
- Effects of GC Bias in Next-Generation-Sequencing Data on De Novo Genome AssemblyPLOS ONE, 2013
- Towards practical, high-capacity, low-maintenance information storage in synthesized DNANature, 2013
- Characterizing and measuring bias in sequence dataGenome Biology, 2013
- Next-Generation Digital Information Storage in DNAScience, 2012
- Summarizing and correcting the GC content bias in high-throughput sequencingNucleic Acids Research, 2012
- Length and GC-biases during sequencing library amplification: A comparison of various polymerase-buffer systems with ancient and modern DNA sequencing librariesBioTechniques, 2012
- Analyzing and minimizing PCR amplification bias in Illumina sequencing librariesGenome Biology, 2011
- Fast and accurate long-read alignment with Burrows–Wheeler transformBioinformatics, 2010
- Confidence intervals for nonhomogeneous branching processes and polymerase chain reactionsThe Annals of Probability, 2005
- Modelling the PCR amplification process by a size-dependent branching process and estimation of the efficiencyAdvances in Applied Probability, 2004