Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible

Top Cited Papers

Open Access

3 April 2014

journal article
research article
Published by Public Library of Science (PLoS) in PLoS Computational Biology

Vol. 10 (4), e1003531
https://doi.org/10.1371/journal.pcbi.1003531

Abstract

Current practice in the normalization of microbiome count data is inefficient in the statistical sense. For apparently historical reasons, the common approach is either to use simple proportions (which does not address heteroscedasticity) or to use rarefying of counts, even though both of these approaches are inappropriate for detection of differentially abundant species. Well-established statistical theory is available that simultaneously accounts for library size differences and biological variability using an appropriate mixture model. Moreover, specific implementations for DNA sequencing read count data (based on a Negative Binomial model for instance) are already available in RNA-Seq focused R packages such as edgeR and DESeq. Here we summarize the supporting statistical theory and use simulations and empirical data to demonstrate substantial improvements provided by a relevant mixture model framework over simple proportions or rarefying. We show how both proportions and rarefied counts result in a high rate of false positives in tests for species that are differentially abundant across sample classes. Regarding microbiome sample-wise clustering, we also show that the rarefying procedure often discards samples that can be accurately clustered by alternative methods. We further compare different Negative Binomial methods with a recently-described zero-inflated Gaussian mixture, implemented in a package called metagenomeSeq. We find that metagenomeSeq performs well when there is an adequate number of biological replicates, but it nevertheless tends toward a higher false positive rate. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq. The term microbiome refers to the ecosystem of microbes that live in a defined environment. The decreasing cost and increasing speed of DNA sequencing technology has recently provided scientists with affordable and timely access to the genes and genomes of microbiomes that inhabit our planet and even our own bodies. In these investigations many microbiome samples are sequenced at the same time on the same DNA sequencing machine, but often result in total numbers of sequences per sample that are vastly different. The common procedure for addressing this difference in sequencing effort across samples – different library sizes – is to either (1) base analyses on the proportional abundance of each species in a library, or (2) rarefy, throw away sequences from the larger libraries so that all have the same, smallest size. We show that both of these normalization methods can work when comparing obviously-different whole microbiomes, but that neither method works well when comparing the relative proportions of each bacterial species across microbiome samples. We show that alternative methods based on a statistical mixture model perform much better and can be easily adapted from a separate biological sub-discipline, called RNA-Seq analysis.

Keywords

Other Versions

Version 2, 2013-10-01, preprints

This publication has 67 references indexed in Scilit:

TCC: an R package for comparing tag count data with robust normalization strategies
BMC Bioinformatics, 2013
phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data
PLOS ONE, 2013
Computational meta'omics for microbial community studies
Molecular Systems Biology, 2013
Human gut microbiome viewed across age and geography
Nature, 2012
Architectural design influences the diversity and structure of the built environment microbiome
The ISME Journal, 2012
UniFrac: an effective distance metric for microbial community comparison
The ISME Journal, 2010
Fast UniFrac: facilitating high-throughput phylogenetic analyses of microbial communities including analysis of pyrosequencing and PhyloChip data
The ISME Journal, 2009
Next-generation DNA sequencing
Nature Biotechnology, 2008
Mapping and quantifying mammalian transcriptomes by RNA-Seq
Nature Methods, 2008
Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex
Nature Methods, 2008

Cited by 2226 articles