Sparse Supermatrices for Phylogenetic Inference: Taxonomy, Alignment, Rogue Taxa, and the Phylogeny of Living Turtles

Abstract
As phylogenetic data sets grow in size and number, objective methods to summarize this information are becoming increasingly important. Supermatrices can combine existing data directly and in principle provide effective syntheses of phylogenetic information that may reveal new relationships. However, several serious difficulties exist in the construction of large supermatrices that must be overcome before these approaches will enjoy broad utility. We present analyses that examine the performance of sparse supermatrices constructed from large sequence databases for the reconstruction of species-level phylogenies. We develop a largely automated informatics pipeline that allows for the construction of sparse supermatrices from GenBank data. In doing so, we develop strategies for alleviating some of the outstanding impediments to accurate phylogenetic inference using these approaches. These include taxonomic standardization, automated alignment, and the identification of rogue taxa. We use turtles as an exemplar clade and present a well-supported species-level phylogeny for two-thirds of all turtle species based on a ∼50 kb supermatrix consisting of 93% missing data. Finally, we discuss some of the remaining pitfalls and concerns associated with supermatrix analyses, provide comparisons to supertree approaches, and suggest areas for future research.