A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance Estimation

Open Access

30 May 2008

journal article
research article
Published by Public Library of Science (PLoS) in PLoS Computational Biology

Vol. 4 (5), e1000069
https://doi.org/10.1371/journal.pcbi.1000069

Abstract

Sequence database searches require accurate estimation of the statistical significance of scores. Optimal local sequence alignment scores follow Gumbel distributions, but determining an important parameter of the distribution (λ) requires time-consuming computational simulation. Moreover, optimal alignment scores are less powerful than probabilistic scores that integrate over alignment uncertainty (“Forward” scores), but the expected distribution of Forward scores remains unknown. Here, I conjecture that both expected score distributions have simple, predictable forms when full probabilistic modeling methods are used. For a probabilistic model of local sequence alignment, optimal alignment bit scores (“Viterbi” scores) are Gumbel-distributed with constant λ = log 2, and the high scoring tail of Forward scores is exponential with the same constant λ. Simulation studies support these conjectures over a wide range of profile/sequence comparisons, using 9,318 profile-hidden Markov models from the Pfam database. This enables efficient and accurate determination of expectation values (E-values) for both Viterbi and Forward scores for probabilistic local alignments. Sequence database searches are a fundamental tool of molecular biology, enabling researchers to identify related sequences in other organisms, which often provides invaluable clues to the function and evolutionary history of genes. The power of database searches to detect more and more remote evolutionary relationships – essentially, to look back deeper in time – has improved steadily, with the adoption of more complex and realistic models. However, database searches require not just a realistic scoring model, but also the ability to distinguish good scores from bad ones – the ability to calculate the statistical significance of scores. For many models and scoring schemes, accurate statistical significance calculations have either involved expensive computational simulations, or not been feasible at all. Here, I introduce a probabilistic model of local sequence alignment that has readily predictable score statistics for position-specific profile scoring systems, and not just for traditional optimal alignment scores, but also for more powerful log-likelihood ratio scores derived in a full probabilistic inference framework. These results remove one of the main obstacles that have impeded the use of more powerful and biologically realistic statistical inference methods in sequence homology searches.

This publication has 52 references indexed in Scilit:

The identification of complete domains within protein sequences using accurate E-values for semi-global alignment
Nucleic Acids Research, 2007
Query-Dependent Banding (QDB) for Faster RNA Similarity Searches
PLoS Computational Biology, 2007
CDD: a conserved domain database for interactive domain family analysis
Nucleic Acids Research, 2007
Pfam: clans, web tools and services
Nucleic Acids Research, 2006
Accurate formula for P-values of gapped local sequence and profile alignments
Journal of Molecular Biology, 2000
Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods
Journal of Molecular Biology, 1998
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Research, 1997
A reliable sequence alignment method based on probabilities of residue correspondences
"Protein Engineering, Design and Selection", 1995
Hidden Markov Models in Computational Biology: Applications to Protein Modeling
Journal of Molecular Biology, 1994
Basic local alignment search tool
Journal of Molecular Biology, 1990

Cited by 271 articles