FragBag, an accurate representation of protein structure, retrieves structural neighbors from the entire PDB quickly and accurately
- 3 February 2010
- journal article
- research article
- Published by Proceedings of the National Academy of Sciences in Proceedings of the National Academy of Sciences
- Vol. 107 (8), 3481-3486
- https://doi.org/10.1073/pnas.0914097107
Abstract
Fast identification of protein structures that are similar to a specified query structure in the entire Protein Data Bank (PDB) is fundamental in structure and function prediction. We present FragBag: An ultrafast and accurate method for comparing protein structures. We describe a protein structure by the collection of its overlapping short contiguous backbone segments, and discretize this set using a library of fragments. Then, we succinctly represent the protein as a "bags-of-fragments"-a vector that counts the number of occurrences of each fragment-and measure the similarity between two structures by the similarity between their vectors. Our representation has two additional benefits: (i) it can be used to construct an inverted index, for implementing a fast structural search engine of the entire PDB, and (ii) one can specify a structure as a collection of substructures, without combining them into a single structure; this is valuable for structure prediction, when there are reliable predictions only of parts of the protein. We use receiver operating characteristic curve analysis to quantify the success of FragBag in identifying neighbor candidate sets in a dataset of over 2,900 structures. The gold standard is the set of neighbors found by six state of the art structural aligners. Our best FragBag library finds more accurate candidate sets than the three other filter methods: The SGM, PRIDE, and a method by Zotenko et al. More interestingly, FragBag performs on a par with the computationally expensive, yet highly trusted structural aligners STRUCTAL and CE.Keywords
This publication has 37 references indexed in Scilit:
- Is protein classification necessary? Toward alternative approaches to function annotationCurrent Opinion in Structural Biology, 2009
- Progress and challenges in protein structure predictionCurrent Opinion in Structural Biology, 2008
- Sequence‐similar, structure‐dissimilar protein pairs in the PDBProteins-Structure Function and Bioinformatics, 2007
- Critical assessment of methods of protein structure prediction—Round VIIProteins-Structure Function and Bioinformatics, 2007
- SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognitionBMC Bioinformatics, 2007
- Kappa-alpha plot derived structural alphabet and BLOSUM-like substitution matrix for rapid search of protein structure databaseGenome Biology, 2007
- Predicting protein function from sequence and structural dataCurrent Opinion in Structural Biology, 2005
- Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensionsActa Crystallographica Section D-Biological Crystallography, 2004
- Protein fold similarity estimated by a probabilistic approach based on Cα-Cα distance comparisonJournal of Molecular Biology, 2002
- BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequencesFEMS Microbiology Letters, 1999