Mash: fast genome and metagenome distance estimation using MinHash
Top Cited Papers
Open Access
- 20 June 2016
- journal article
- software
- Published by Springer Science and Business Media LLC in Genome Biology
- Vol. 17 (1), 1-14
- https://doi.org/10.1186/s13059-016-0997-x
Abstract
Mash extends the MinHash dimensionality-reduction technique to include a pairwise mutation distance and P value significance test, enabling the efficient clustering and search of massive sequence collections. Mash reduces large sequences and sequence sets to small, representative sketches, from which global mutation distances can be rapidly estimated. We demonstrate several use cases, including the clustering of all 54,118 NCBI RefSeq genomes in 33 CPU h; real-time database search using assembled or unassembled Illumina, Pacific Biosciences, and Oxford Nanopore data; and the scalable clustering of hundreds of metagenomic samples by composition. Mash is freely released under a BSD license ( https://github.com/marbl/mash ).Funding Information
- National Human Genome Research Institute (Intramural Research Program)
- Science and Technology Directorate (HSHQDC-07-C-00020)
This publication has 49 references indexed in Scilit:
- Co-phylog: an assembly-free phylogenomic approach for closely related organismsNucleic Acids Research, 2013
- Structure, function and diversity of the healthy human microbiomeNature, 2012
- SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell SequencingJournal of Computational Biology, 2012
- NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policyNucleic Acids Research, 2011
- Fast and accurate short read alignment with Burrows–Wheeler transformBioinformatics, 2009
- A Genomic Distance Based on MUM Indicates Discontinuity between Most Bacterial Species and GeneraJournal of Bacteriology, 2009
- Performance comparison between k-tuple distance and four model-based distances in phylogenetic tree reconstructionNucleic Acids Research, 2008
- 28-Way vertebrate alignment and conservation track in the UCSC Genome BrowserGenome Research, 2007
- Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction NetworksGenome Research, 2003
- Approximate string-matching with q-grams and maximal matchesTheoretical Computer Science, 1992