Support Vector Machine-based classification of protein folds using the structural properties of amino acid residues and amino acid residue pairs

Open Access

7 November 2007

journal article
research article
Published by Oxford University Press (OUP) in Bioinformatics

Vol. 23 (24), 3320-3327
https://doi.org/10.1093/bioinformatics/btm527

Abstract

Motivation: Fold recognition is a key step in the protein structure discovery process, especially when traditional sequence comparison methods fail to yield convincing structural homologies. Although many methods have been developed for protein fold recognition, their accuracies remain low. This can be attributed to insufficient exploitation of fold discriminatory features. Results: We have developed a new method for protein fold recognition using structural information of amino acid residues and amino acid residue pairs. Since protein fold recognition can be treated as a protein fold classification problem, we have developed a Support Vector Machine (SVM) based classifier approach that uses secondary structural state and solvent accessibility state frequencies of amino acids and amino acid pairs as feature vectors. Among the individual properties examined secondary structural state frequencies of amino acids gave an overall accuracy of 65.2% for fold discrimination, which is better than the accuracy by any method reported so far in the literature. Combination of secondary structural state frequencies with solvent accessibility state frequencies of amino acids and amino acid pairs further improved the fold discrimination accuracy to more than 70%, which is ∼8% higher than the best available method. In this study we have also tested, for the first time, an all-together multi-class method known as Crammer and Singer method for protein fold classification. Our studies reveal that the three multi-class classification methods, namely one versus all, one versus one and Crammer and Singer method, yield similar predictions. Availability: Dataset and stand-alone program are available upon request. Contact:han@cdfd.org.in Supplementary information: Supplementary data are available at Bioinformatics online.

Keywords

This publication has 24 references indexed in Scilit:

GNBSL: A new integrative system to predict the subcellular location for Gram‐negative bacteria proteins
Proteomics, 2006
A machine learning information retrieval approach to protein fold recognition
Bioinformatics, 2006
SCRATCH: a protein structure and structural feature prediction server
Nucleic Acids Research, 2005
Support Vector Machine-based Method for Subcellular Localization of Human Proteins Using Amino Acid Compositions, Their Order, and Similarity Search
Journal of Biological Chemistry, 2005
Classification of Nuclear Receptors Based on Amino Acid Composition and Dipeptide Composition
Published by Elsevier BV ,2004
The ASTRAL Compendium in 2004
Nucleic Acids Research, 2004
On the Learnability and Design of Output Codes for Multiclass Problems
Machine Learning, 2002
Multi-class protein fold recognition using support vector machines and neural networks
Bioinformatics, 2001
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Research, 1997
Note on Free Lunches and Cross-Validation
Neural Computation, 1997

Cited by 83 articles