Naught all zeros in sequence count data are the same
Open Access
- 26 November 2018
- preprint content
- Published by Cold Spring Harbor Laboratory
- p. 477794
- https://doi.org/10.1101/477794
Abstract
Genomic studies feature multivariate count data from high-throughput DNA sequencing experiments, which often contain many zero values. These zeros can cause artifacts for statistical analyses and multiple modeling approaches have been developed in response. Here, we apply common zero-handling models to gene-expression and microbiome datasets and show models disagree on average by 46% in terms of identifying the most differentially expressed sequences. Next, to rationally examine how different zero handling models behave, we developed a conceptual framework outlining four types of processes that may give rise to zero values in sequence count data. Last, we performed simulations to test how zero handling models behave in the presence of these different zero generating processes. Our simulations showed that simple count models are sufficient across multiple processes, even when the true underlying process is unknown. On the other hand, a common zero handling technique known as “zero-inflation” was only suitable under a zero generating process associated with an unlikely set of biological and experimental conditions. In concert, our work here suggests several specific guidelines for developing and choosing state-of-the-art models for analyzing sparse sequence count data.Keywords
This publication has 63 references indexed in Scilit:
- Evidence of a Functional Estrogen Receptor in Parathyroid AdenomasJournal of Clinical Endocrinology & Metabolism, 2012
- PCR Biases Distort Bacterial and Archaeal Community Structure in Pyrosequencing DatasetsPLOS ONE, 2012
- Perceptions and Experiences of Research Participants on Gender-Based Violence Community Based Survey: Implications for Ethical GuidelinesPLOS ONE, 2012
- Summarizing and correcting the GC content bias in high-throughput sequencingNucleic Acids Research, 2012
- Genomic analysis identifies association of Fusobacterium with colorectal carcinomaGenome Research, 2011
- Analyzing and minimizing PCR amplification bias in Illumina sequencing librariesGenome Biology, 2011
- Tackling the widespread and critical impact of batch effects in high-throughput dataNature Reviews Genetics, 2010
- edgeR: a Bioconductor package for differential expression analysis of digital gene expression dataBioinformatics, 2009
- Detection of Actinobacteria cultivated from environmental samples reveals bias in universal primersLetters in Applied Microbiology, 2007
- PCR-Induced Sequence Artifacts and Bias: Insights from Comparison of Two 16S rRNA Clone Libraries Constructed from the Same SampleApplied and Environmental Microbiology, 2005