Multiple alignment by aligning alignments
Open Access
- 1 July 2007
- journal article
- research article
- Published by Oxford University Press (OUP) in Bioinformatics
- Vol. 23 (13), i559-i568
- https://doi.org/10.1093/bioinformatics/btm226
Abstract
Motivation: Multiple sequence alignment is a fundamental task in bioinformatics. Current tools typically form an initial alignment by merging subalignments, and then polish this alignment by repeated splitting and merging of subalignments to obtain an improved final alignment. In general this form-and-polish strategy consists of several stages, and a profusion of methods have been tried at every stage. We carefully investigate: (1) how to utilize a new algorithm for aligning alignments that optimally solves the common subproblem of merging subalignments, and (2) what is the best choice of method for each stage to obtain the highest quality alignment. Results: We study six stages in the form-and-polish strategy for multiple alignment: parameter choice, distance estimation, merge-tree construction, sequence-pair weighting, alignment merging, and polishing. For each stage, we consider novel approaches as well as standard ones. Interestingly, the greatest gains in alignment quality come from (i) estimating distances by a new approach using normalized alignment costs, and (ii) polishing by a new approach using 3-cuts. Experiments with a parameter-value oracle suggest large gains in quality may be possible through an input-dependent choice of alignment parameters, and we present a promising approach for building such an oracle. Combining the best approaches to each stage yields a new tool we call Opal that on benchmark alignments matches the quality of the top tools, without employing alignment consistency or hydrophobic gap penalties. Availability:Opal, a multiple alignment tool that implements the best methods in our study, is freely available at http://opal.cs.arizona.edu Contact:twheeler@cs.arizona.eduThis publication has 32 references indexed in Scilit:
- ProbCons: Probabilistic consistency-based multiple sequence alignmentGenome Research, 2005
- MUSCLE: multiple sequence alignment with high accuracy and high throughputNucleic Acids Research, 2004
- PALI--a database of Phylogeny and ALIgnment of homologous protein structuresNucleic Acids Research, 2001
- BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutationsNucleic Acids Research, 2001
- A novel randomized iterative strategy for aligning multiple protein sequencesBioinformatics, 1991
- Leaf Pairs and Tree DissectionsSIAM Journal on Discrete Mathematics, 1989
- Weights for data related by a treeJournal of Molecular Biology, 1989
- Gap costs for multiple sequence alignmentJournal of Theoretical Biology, 1989
- The Multiple Sequence Alignment Problem in BiologySIAM Journal on Applied Mathematics, 1988
- Progressive sequence alignment as a prerequisitetto correct phylogenetic treesJournal of Molecular Evolution, 1987