RepeatModeler2 for automated genomic discovery of transposable element families
Top Cited Papers
- 16 April 2020
- journal article
- research article
- Published by Proceedings of the National Academy of Sciences in Proceedings of the National Academy of Sciences of the United States of America
- Vol. 117 (17), 9451-9457
- https://doi.org/10.1073/pnas.1921046117
Abstract
The accelerating pace of genome sequencing throughout the tree of life is driving the need for improved unsupervised annotation of genome components such as transposable elements (TEs). Because the types and sequences of TEs are highly variable across species, automated TE discovery and annotation are challenging and time-consuming tasks. A critical first step is the de novo identification and accurate compilation of sequence models representing all of the unique TE families dispersed in the genome. Here we introduce RepeatModeler2, a pipeline that greatly facilitates this process. This program brings substantial improvements over the original version of RepeatModeler, one of the most widely used tools for TE discovery. In particular, this version incorporates a module for structural discovery of complete long terminal repeat (LTR) retroelements, which are widespread in eukaryotic genomes but recalcitrant to automated identification because of their size and sequence complexity. We benchmarked RepeatModeler2 on three model species with diverse TE landscapes and high-quality, manually curated TE libraries: Drosophila melanogaster (fruit fly), Danio rerio (zebrafish), and Oryza sativa (rice). In these three species, RepeatModeler2 identified approximately 3 times more consensus sequences matching with >95% sequence identity and sequence coverage to the manually curated sequences than the original RepeatModeler. As expected, the greatest improvement is for LTR retroelements. Thus, RepeatModeler2 represents a valuable addition to the genome annotation toolkit that will enhance the identification and study of TEs in eukaryotic genome sequences. RepeatModeler2 is available as source code or a containerized package under an open license (https://github.com/Dfam-consortium/RepeatModeler, http://www.repeatmasker.org/RepeatModeler/).Keywords
Funding Information
- HHS | NIH | National Human Genome Research Institute (U01-HG009391)
- HHS | NIH | National Institute of General Medical Sciences (R35-GM122550)
- HHS | NIH | National Human Genome Research Institute (U24 HG010136)
- HHS | NIH | National Human Genome Research Institute (R01 HG002939)
- Canadian Network for Research and Innovation in Machining Technology, Natural Sciences and Engineering Research Council of Canada (NSERC PGSD graduate fellowship)
- NIGMS (R01 GM119125)
This publication has 51 references indexed in Scilit:
- nhmmer: DNA homology search with profile HMMsBioinformatics, 2013
- Active Transposition in GenomesAnnual Review of Genetics, 2012
- MITE-Hunter: a program for discovering miniature inverted-repeat transposable elements from genomic sequencesNucleic Acids Research, 2010
- Structure-based discovery and description of plant and animal HelitronsProceedings of the National Academy of Sciences of the United States of America, 2009
- Exploring Repetitive DNA Landscapes Using REPCLASS, a Tool That Automates the Classification of Transposable Elements in Eukaryotic GenomesGenome Biology and Evolution, 2009
- Empirical comparison of ab initio repeat finding programsNucleic Acids Research, 2008
- LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposonsBMC Bioinformatics, 2008
- DNA Transposons and the Evolution of Eukaryotic GenomesAnnual Review of Genetics, 2007
- De novo identification of repeat families in large genomesBioinformatics, 2005
- Initial sequencing and analysis of the human genomeNature, 2001