Combined Evidence Annotation of Transposable Elements in Genome Sequences

Open Access

1 January 2005

journal article
research article
Published by Public Library of Science (PLoS) in PLoS Computational Biology

Vol. 1 (2), e22-75
https://doi.org/10.1371/journal.pcbi.0010022

Abstract

Transposable elements (TEs) are mobile, repetitive sequences that make up significant fractions of metazoan genomes. Despite their near ubiquity and importance in genome and chromosome biology, most efforts to annotate TEs in genome sequences rely on the results of a single computational program, RepeatMasker. In contrast, recent advances in gene annotation indicate that high-quality gene models can be produced from combining multiple independent sources of computational evidence. To elevate the quality of TE annotations to a level comparable to that of gene models, we have developed a combined evidence-model TE annotation pipeline, analogous to systems used for gene annotation, by integrating results from multiple homology-based and de novo TE identification methods. As proof of principle, we have annotated “TE models” in Drosophila melanogaster Release 4 genomic sequences using the combined computational evidence derived from RepeatMasker, BLASTER, TBLASTX, all-by-all BLASTN, RECON, TE-HMM and the previous Release 3.1 annotation. Our system is designed for use with the Apollo genome annotation tool, allowing automatic results to be curated manually to produce reliable annotations. The euchromatic TE fraction of D. melanogaster is now estimated at 5.3% (cf. 3.86% in Release 3.1), and we found a substantially higher number of TEs (n = 6,013) than previously identified (n = 1,572). Most of the new TEs derive from small fragments of a few hundred nucleotides long and highly abundant families not previously annotated (e.g., INE-1). We also estimated that 518 TE copies (8.6%) are inserted into at least one other TE, forming a nest of elements. The pipeline allows rapid and thorough annotation of even the most complex TE models, including highly deleted and/or nested elements such as those often found in heterochromatic sequences. Our pipeline can be easily adapted to other genome sequences, such as those of the D. melanogaster heterochromatin or other species in the genus Drosophila. A first step in adding value to the large-scale DNA sequences generated by genome projects is the process of annotation—marking biological features on the raw string of adenines, cytosines, guanines, and thymines. The predominant goal in genome annotation thus far has been to identify gene sequences that encode proteins; however, many functional sequences exist in non-protein-coding regions and their annotation remains incomplete. Mobile, repetitive DNA segments known as transposable elements (TEs) are one class of functional sequence in non-protein-coding regions, which can make up large fractions of genome sequences (e.g., about 45% in the human) and can play important roles in gene and chromosome structure and regulation. As a consequence, there has been increasing interest in the computational identification of TEs in genome sequences. Borrowing current ideas from the field of gene annotation, the authors have developed a pipeline to predict TEs in genome sequences that combines multiple sources of evidence from different computational methods. The authors' combined-evidence pipeline represents an important step towards raising the standards of TE annotation to the same quality as that of genes, and should help catalyze their understanding of the biological role of these fascinating sequences.

Keywords

This publication has 36 references indexed in Scilit:

De novo identification of repeat families in large genomes
Bioinformatics, 2005
Computational Gene Prediction Using Multiple Sources of Evidence
Genome Research, 2004
Initial sequencing and analysis of the human genome
Nature, 2001
A Greedy Algorithm for Aligning DNA Sequences
Journal of Computational Biology, 2000
Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology
Journal of the American Statistical Association, 1999
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Research, 1997
A local alignment tool for very long DNA sequences
Bioinformatics, 1995
Identification of protein coding regions by database similarity search
Nature Genetics, 1993
Basic local alignment search tool
Journal of Molecular Biology, 1990
Eucaryotic transposable genetic elements with inverted terminal repeats
Cell, 1980

Cited by 315 articles