Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships

26 May 1998

journal article
research article
Published by Proceedings of the National Academy of Sciences in Proceedings of the National Academy of Sciences of the United States of America

Vol. 95 (11), 6073-6078
https://doi.org/10.1073/pnas.95.11.6073

Abstract

Pairwise sequence comparison methods have been assessed using proteins whose relationships are known reliably from their structures and functions, as described in the scop database [Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia C. (1995) J. Mol. Biol. 247, 536–540]. The evaluation tested the programs blast [Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990). J. Mol. Biol. 215, 403–410], wu-blast2 [Altschul, S. F. & Gish, W. (1996) Methods Enzymol. 266, 460–480], fasta [Pearson, W. R. & Lipman, D. J. (1988) Proc. Natl. Acad. Sci. USA 85, 2444–2448], and ssearch [Smith, T. F. & Waterman, M. S. (1981) J. Mol. Biol. 147, 195–197] and their scoring schemes. The error rate of all algorithms is greatly reduced by using statistical scores to evaluate matches rather than percentage identity or raw scores. The E-value statistical scores of ssearch and fasta are reliable: the number of false positives found in our tests agrees well with the scores reported. However, the P-values reported by blast and wu-blast2 exaggerate significance by orders of magnitude. ssearch, fasta ktup = 1, and wu-blast2 perform best, and they are capable of detecting almost all relationships between proteins whose sequence identities are >30%. For more distantly related proteins, they do much less well; only one-half of the relationships between proteins with 20–30% identity are found. Because many homologs have low sequence similarity, most distant relationships cannot be detected by any pairwise comparison method; however, those which are identified may be used with confidence.

Keywords

This publication has 38 references indexed in Scilit:

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Research, 1997
CATH – a hierarchic classification of protein domain structures
Structure, 1997
An Assessment of Amino Acid Exchange Matrices in Aligning Protein Sequences: The Twilight Zone Revisited
Journal of Molecular Biology, 1995
SCOP: A structural classification of proteins database for the investigation of sequences and structures
Journal of Molecular Biology, 1995
A Structural Basis for Sequence Comparisons
Journal of Molecular Biology, 1993
Basic local alignment search tool
Journal of Molecular Biology, 1990
Evaluation and improvements in the automatic alignment of protein sequences
Protein Engineering, Design and Selection, 1987
Molecular packing and intermolecular contacts of sickling deer type III hemoglobin
Journal of Molecular Biology, 1979
An improved method of testing for evolutionary homology
Journal of Molecular Biology, 1966
Structure and function of haemoglobin: II. Some relations between polypeptide chain configuration and amino acid sequence
Journal of Molecular Biology, 1965

Cited by 378 articles