Prediction of Protein–Protein Interaction Sites in Sequences and 3D Structures by Random Forests
Open Access
- 30 January 2009
- journal article
- research article
- Published by Public Library of Science (PLoS) in PLoS Computational Biology
- Vol. 5 (1), e1000278
- https://doi.org/10.1371/journal.pcbi.1000278
Abstract
Identifying interaction sites in proteins provides important clues to the function of a protein and is becoming increasingly relevant in topics such as systems biology and drug discovery. Although there are numerous papers on the prediction of interaction sites using information derived from structure, there are only a few case reports on the prediction of interaction residues based solely on protein sequence. Here, a sliding window approach is combined with the Random Forests method to predict protein interaction sites using (i) a combination of sequence- and structure-derived parameters and (ii) sequence information alone. For sequence-based prediction we achieved a precision of 84% with a 26% recall and an F-measure of 40%. When combined with structural information, the prediction performance increases to a precision of 76% and a recall of 38% with an F-measure of 51%. We also present an attempt to rationalize the sliding window size and demonstrate that a nine-residue window is the most suitable for predictor construction. Finally, we demonstrate the applicability of our prediction methods by modeling the Ras–Raf complex using predicted interaction sites as target binding interfaces. Our results suggest that it is possible to predict protein interaction sites with quite a high accuracy using only sequence information. In their active state, proteins—the workhorses of a living cell—need to have a defined 3D structure. The majority of functions in the living cell are performed through protein interactions that occur through specific, often unknown, residues on their surfaces. We can study protein interactions either qualitatively (interaction: yes/no) using large-scale, high-throughput experiments or determine specific interaction sites by using biophysical techniques, such as, for example, X-ray crystallography, that are much more laborious and yet unable to provide us with a complete interaction map within the cell. This paper presents the machine learning classification method termed “Random Forests” in its application to predicting interaction sites. We use interaction data from available experimental evidence to train the classifier and predict the interacting residues on proteins with unknown 3D structures. Using this approach, we are able to predict many more interactions in greater detail (i.e., to accurately predict most of the binding site) and with that to infer knowledge about the functions of unknown proteins.This publication has 36 references indexed in Scilit:
- PSAIA – Protein Structure and Interaction AnalyzerBMC Structural Biology, 2008
- The Universal Protein Resource (UniProt)Nucleic Acids Research, 2007
- A Detailed Thermodynamic Analysis of Ras/Effector Complex InterfacesJournal of Molecular Biology, 2004
- ProMate: A Structure Based Prediction Program to Identify the Location of Protein–Protein Binding SitesJournal of Molecular Biology, 2004
- A point‐charge force field for molecular mechanics simulations of proteins based on condensed‐phase quantum mechanical calculationsJournal of Computational Chemistry, 2003
- Insights into Protein–Protein Binding by Binding Free Energy Calculation and Free Energy Decomposition for the Ras–Raf and Ras–RalGDS ComplexesJournal of Molecular Biology, 2003
- The Protein Data BankNucleic Acids Research, 2000
- Prediction of protein-protein interaction sites using patch analysisJournal of Molecular Biology, 1997
- The use of the area under the ROC curve in the evaluation of machine learning algorithmsPattern Recognition, 1997
- Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical featuresPeptide Science, 1983