Maximum Likelihood Trees from DNA Sequences: A Peculiar Statistical Estimation Problem

Abstract
The parameter space of the phylogenetic tree estimation problem consists of three components, T, t, and θ. The tree topology T is a discrete entity that is not a proper statistical parameter but that can nevertheless be estimated using the maximum likelihood criterion. Its role is to specify the branch length parameters and the form of the likelihood function(s). Branch lengths t are conditional on T and are meaningful only for specific values of T. Parameters θ in the model of nucleotide substitution are common to all the tree topologies and represent such values as the transition/transversion rate ratio. T and t thus represent the tree, and θ represents the model. With typical DNA sequence data, differences in T have only a small effect on the likelihood, but changing θ will influence the likelihood greatly. Estimates of θ are also found to be insensitive to T, making it possible to obtain reliable estimates of θ and to perform tests concerning the model (θ) even if knowledge of the evolutionary relationship (T) is not available. In contrast, tests concerning t, such as testing the existence of a molecular clock, appear to be more difficult to perform when the true topology is unknown. In this paper, we explore the peculiarity of the parameter space of the tree estimation problem and suggest methods for overcoming some difficulties involved with tests concerning the model. We also address difficulties concerning hypothesis testing on T, i.e., evaluation of the reliability of the estimated tree topology. We note that estimation of and particularly tests concerning T depend critically on the assumed model.