Identification of protein functions using a machine-learning approach based on sequence-derived properties

Open Access

9 August 2009

journal article
Published by Springer Science and Business Media LLC in Proteome Science

Vol. 7 (1), 27
https://doi.org/10.1186/1477-5956-7-27

Abstract

Background: Predicting the function of an unknown protein is an essential goal in bioinformatics. Sequence similarity-based approaches are widely used for function prediction; however, they are often inadequate in the absence of similar sequences or when the sequence similarity among known protein sequences is statistically weak. This study aimed to develop an accurate prediction method for identifying protein function, irrespective of sequence and structural similarities. Results: A highly accurate prediction method capable of identifying protein function, based solely on protein sequence properties, is described. This method analyses and identifies specific features of the protein sequence that are highly correlated with certain protein functions and determines the combination of protein sequence features that best characterises protein function. Thirty-three features that represent subtle differences in local regions and full regions of the protein sequences were introduced. On the basis of 484 features extracted solely from the protein sequence, models were built to predict the functions of 11 different proteins from a broad range of cellular components, molecular functions, and biological processes. The accuracy of protein function prediction using random forests with feature selection ranged from 94.23% to 100%. The local sequence information was found to have a broad range of applicability in predicting protein function. Conclusion: We present an accurate prediction method using a machine-learning approach based solely on protein sequence properties. The primary contribution of this paper is to propose new PNPRD features representing global and/or local differences in sequences, based on positively and/or negatively charged residues, to assist in predicting protein function. In addition, we identified a compact and useful feature subset for predicting the function of various proteins. Our results indicate that sequence-based classifiers can provide good results among a broad range of proteins, that the proposed features are useful in predicting several functions, and that the combination of our and traditional features may support the creation of a discriminative feature set for specific protein functions.

Keywords

This publication has 73 references indexed in Scilit:

Conditional variable importance for random forests
BMC Bioinformatics, 2008
The combination approach of SVM and ECOC for powerful identification and classification of transcription factor
BMC Bioinformatics, 2008
Classification of premalignant pancreatic cancer mass-spectrometry data using decision tree ensembles
BMC Bioinformatics, 2008
A comparison of machine learning algorithms for chemical toxicity classification using a simulated multi-scale data model
BMC Bioinformatics, 2008
Prediction of potential drug targets based on simple sequence properties
BMC Bioinformatics, 2007
A machine learning approach for the identification of odorant binding proteins from sequence-derived properties
BMC Bioinformatics, 2007
Proteome-Wide Prediction of Novel DNA/RNA-Binding Proteins Using Amino Acid Composition and Periodicity in the Hyperthermophilic Archaeon Pyrococcus furiosus
DNA Research, 2007
Identification of Functionally Important Negatively Charged Residues in the Carboxy End of Mouse Hepatitis Coronavirus A59 Nucleocapsid Protein
Journal of Virology, 2006
Feature-based prediction of non-classical and leaderless protein secretion
Protein Engineering, Design and Selection, 2004
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Research, 1997

Cited by 48 articles