Signal, Noise, and Reliability in Molecular Phylogenetic Analyses

Abstract
DNA sequences and other molecular data compared among organisms may contain phylogenetic signal, or they may be randomized with respect to phylogenetic history. Some method is needed to distinguish phylogenetic signal from random noise to avoid analysis of data that have been randomized with respect to the historical relationships of the taxa being compared. We analyzed 8,000 random data matrices consisting of 10–500 binary or four-state characters and 5–25 taxa to study several options for detecting signal in systematic data bases. Analysis of random data often yields a single most-parsimonious tree, especially if the number of characters examined is large and the number of taxa examined is small (both often true in molecular studies). The most-parsimonious tree inferred from random data may also be considerably shorter than the second-best alternative. The distribution of tree lengths of all tree topologies (or a random sample thereof) provides a sensitive measure of phylogenetic signal: data matrices with phylogenetic signal produce tree-length distributions that are strongly skewed to the left, whereas those composed of random noise are closer to symmetrical. In simulations of phylogeny with varying rates of mutation (up to levels that produce random variation among taxa), the skewness of tree-length distributions is closely related to the success of parsimony in finding the true phylogeny. Tables of critical values of a skewness test statistic, g., are provided for binary and four-state characters for 10–500 characters and 5–25 taxa. These tables can be used in a rapid and efficient test for significant structure in data matrices for phylogenetic analysis.