Microenvironment-Based Protein Function Analysis by Random Forest
- 1 August 2014
- conference paper
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- p. 3138-3143
- https://doi.org/10.1109/icpr.2014.541
Abstract
Machine learning-based prediction of protein functions plays a key role in bioinformatics and pharmaceutical research, facilitating swift discovery of new drugs in high-throughput settings. This paper presents an adaptation of Random Forest to the structure-based protein function prediction. Our system represents protein's 3D physicochemical structural information in microenvironment descriptors whose spatial resolution is much finer than other sequence-based protein descriptors. We prepare our datasets for seven active sites from five protein function classes by using multiple public data banks and train Random Forest classifiers to identify these seven function models in proteins. This paper presents two experiment studies: 1) a 5-fold stratified cross-validation for comparing Random Forest with Naive Bayes and Support Vector Machine and 2) systematic comparison of Random Forest's two variable importance measures. Promising results of these studies demonstrate a potential for Random Forest to improve the accuracy of the current protein function assays.Keywords
This publication has 21 references indexed in Scilit:
- New and continuing developments at PROSITENucleic Acids Research, 2012
- Decision Forests: A Unified Framework for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised LearningFoundations and Trends® in Computer Graphics and Vision, 2011
- Prediction of Protein–Protein Interaction Sites in Sequences and 3D Structures by Random ForestsPLoS Computational Biology, 2009
- Progress and challenges in predicting protein-protein interaction sitesBriefings in Bioinformatics, 2008
- Conditional variable importance for random forestsBMC Bioinformatics, 2008
- Gene selection and classification of microarray data using random forestBMC Bioinformatics, 2006
- Prediction of protein–protein interactions using random decision forest frameworkBioinformatics, 2005
- RANDOM FOREST SIMILARITY FOR PROTEIN-PROTEIN INTERACTION PREDICTION FROM MULTIPLE SOURCESPublished by World Scientific Pub Co Pte Ltd ,2004
- The Protein Data BankNucleic Acids Research, 2000
- Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical featuresPeptide Science, 1983