Repeat or not repeat?—Statistical validation of tandem repeat prediction in genomic sequences

Open Access

24 August 2012

journal article
research article
Published by Oxford University Press (OUP) in Nucleic Acids Research

Vol. 40 (20), 10005-10017
https://doi.org/10.1093/nar/gks726

Abstract

Tandem repeats (TRs) represent one of the most prevalent features of genomic sequences. Due to their abundance and functional significance, a plethora of detection tools has been devised over the last two decades. Despite the longstanding interest, TR detection is still not resolved. Our large-scale tests reveal that current detectors produce different, often nonoverlapping inferences, reflecting characteristics of the underlying algorithms rather than the true distribution of TRs in genomic data. Our simulations show that the power of detecting TRs depends on the degree of their divergence, and repeat characteristics such as the length of the minimal repeat unit and their number in tandem. To reconcile the diverse predictions of current algorithms, we propose and evaluate several statistical criteria for measuring the quality of predicted repeat units. In particular, we propose a model-based phylogenetic classifier, entailing a maximum-likelihood estimation of the repeat divergence. Applied in conjunction with the state of the art detectors, our statistical classification scheme for inferred repeats allows to filter out false-positive predictions. Since different algorithms appear to specialize at predicting TRs with certain properties, we advise applying multiple detectors with subsequent filtering to obtain the most complete set of genuine repeats.

Keywords

This publication has 53 references indexed in Scilit:

ALF—A Simulation Framework for Genome Evolution
Molecular Biology and Evolution, 2011
NTRFinder: a software tool to find nested tandem repeats
Nucleic Acids Research, 2011
Ensembl 2011
Nucleic Acids Research, 2010
TRedD--A database for tandem repeats over the edit distance
Database: The Journal of Biological Databases and Curation, 2010
INDELible: A Flexible Simulator of Biological Sequence Evolution
Molecular Biology and Evolution, 2009
An Improved General Amino Acid Replacement Matrix
Molecular Biology and Evolution, 2008
Evolution and diversification of lamprey antigen receptors: evidence for involvement of an AID-APOBEC family cytosine deaminase
Nature Immunology, 2007
Detecting microsatellites within genomes: significant variation among algorithms
BMC Bioinformatics, 2007
Role of SGT1 in resistance protein accumulation in plant immunity
The EMBO Journal, 2006
A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences
Journal of Molecular Evolution, 1980

Cited by 44 articles