On the statistical significance of nucleic add similarities

Abstract
When evaluating sequence similarities among nucleic acids by the usual methods, statistical significance is often found when the biological significance of the similarity is dubious. We demonstrate that the known statistical properties of nucleic acid sequences strongly affect the statistical distribution of similarity values when calculated by standard procedures. We propose a series of models which account for some of these known statistical properties. The utility of the method is demonstrated in evaluating high relative similarity scores in four specific cases in which there is little biological context by which to judge the similarities. In two of the cases we identify the statistical properties which are responsible for the apparent similarity. In the other two cases the statistical significance of the similarity persists even when the known statistical properties of sequences are modelled. For one of these cases biological significance is likely while the other case remains an enigma.

This publication has 13 references indexed in Scilit: