To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics

Open Access

27 April 2020

journal article
research article
Published by Oxford University Press (OUP) in Nucleic Acids Research

Vol. 48 (10), 5217-5234
https://doi.org/10.1093/nar/gkaa265

Abstract

As computational biologists continue to be inundated by ever increasing amounts of metagenomic data, the need for data analysis approaches that keep up with the pace of sequence archives has remained a challenge. In recent years, the accelerated pace of genomic data availability has been accompanied by the application of a wide array of highly efficient approaches from other fields to the field of metagenomics. For instance, sketching algorithms such as MinHash have seen a rapid and widespread adoption. These techniques handle increasingly large datasets with minimal sacrifices in quality for tasks such as sequence similarity calculations. Here, we briefly review the fundamentals of the most impactful probabilistic and signal processing algorithms. We also highlight more recent advances to augment previous reviews in these areas that have taken a broader approach. We then explore the application of these techniques to metagenomics, discuss their pros and cons, and speculate on their future directions.

Funding Information

Office of the Director of National Intelligence
Intelligence Advanced Research Projects Activity
Army Research Office (W911NF-17-2-0089)
Rice University
National Institute of Neurological Disorders and Stroke
National Institutes of Health (R21NS106640)
NSF (CCF-1911094, IIS-1838177, IIS-1730574)
ONR (N00014-18-12571, N00014-17-1-2551)
AFOSR (FA9550-18-1-0478)
DARPA (G001534-7500)
NLM (T15LM007093)
Vannevar Bush Faculty Fellowship (N00014-18-1-2047)
Amazon Research Award

This publication has 102 references indexed in Scilit:

How much metagenomic sequencing is enough to achieve a given goal?
Scientific Reports, 2013
Compressive fluorescence microscopy for biological and hyperspectral imaging
Proceedings of the National Academy of Sciences of the United States of America, 2012
Efficient computation of spaced seeds
BMC Research Notes, 2012
Polymicrobial Interactions: Impact on Pathogenesis and Human Disease
Clinical Microbiology Reviews, 2012
The Sequence Read Archive
Nucleic Acids Research, 2010
Fast and accurate short read alignment with Burrows–Wheeler transform
Bioinformatics, 2009
Methylation Linear Discriminant Analysis (MLDA) for identifying differentially methylated CpG islands
BMC Bioinformatics, 2008
The Human Microbiome Project
Nature, 2007
The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific
PLoS Biology, 2007
NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins
Nucleic Acids Research, 2004

Cited by 16 articles