Prediction of Protein–Protein Interaction Sites in Sequences and 3D Structures by Random Forests

Open Access

30 January 2009

journal article
research article
Published by Public Library of Science (PLoS) in PLoS Computational Biology

Vol. 5 (1), e1000278
https://doi.org/10.1371/journal.pcbi.1000278

Abstract

Identifying interaction sites in proteins provides important clues to the function of a protein and is becoming increasingly relevant in topics such as systems biology and drug discovery. Although there are numerous papers on the prediction of interaction sites using information derived from structure, there are only a few case reports on the prediction of interaction residues based solely on protein sequence. Here, a sliding window approach is combined with the Random Forests method to predict protein interaction sites using (i) a combination of sequence- and structure-derived parameters and (ii) sequence information alone. For sequence-based prediction we achieved a precision of 84% with a 26% recall and an F-measure of 40%. When combined with structural information, the prediction performance increases to a precision of 76% and a recall of 38% with an F-measure of 51%. We also present an attempt to rationalize the sliding window size and demonstrate that a nine-residue window is the most suitable for predictor construction. Finally, we demonstrate the applicability of our prediction methods by modeling the Ras–Raf complex using predicted interaction sites as target binding interfaces. Our results suggest that it is possible to predict protein interaction sites with quite a high accuracy using only sequence information. In their active state, proteins—the workhorses of a living cell—need to have a defined 3D structure. The majority of functions in the living cell are performed through protein interactions that occur through specific, often unknown, residues on their surfaces. We can study protein interactions either qualitatively (interaction: yes/no) using large-scale, high-throughput experiments or determine specific interaction sites by using biophysical techniques, such as, for example, X-ray crystallography, that are much more laborious and yet unable to provide us with a complete interaction map within the cell. This paper presents the machine learning classification method termed “Random Forests” in its application to predicting interaction sites. We use interaction data from available experimental evidence to train the classifier and predict the interacting residues on proteins with unknown 3D structures. Using this approach, we are able to predict many more interactions in greater detail (i.e., to accurately predict most of the binding site) and with that to infer knowledge about the functions of unknown proteins.

This publication has 36 references indexed in Scilit:

PSAIA – Protein Structure and Interaction Analyzer
BMC Structural Biology, 2008
The Universal Protein Resource (UniProt)
Nucleic Acids Research, 2007
A Detailed Thermodynamic Analysis of Ras/Effector Complex Interfaces
Journal of Molecular Biology, 2004
ProMate: A Structure Based Prediction Program to Identify the Location of Protein–Protein Binding Sites
Journal of Molecular Biology, 2004
A point‐charge force field for molecular mechanics simulations of proteins based on condensed‐phase quantum mechanical calculations
Journal of Computational Chemistry, 2003
Insights into Protein–Protein Binding by Binding Free Energy Calculation and Free Energy Decomposition for the Ras–Raf and Ras–RalGDS Complexes
Journal of Molecular Biology, 2003
The Protein Data Bank
Nucleic Acids Research, 2000
Prediction of protein-protein interaction sites using patch analysis
Journal of Molecular Biology, 1997
The use of the area under the ROC curve in the evaluation of machine learning algorithms
Pattern Recognition, 1997
Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features
Peptide Science, 1983

Cited by 133 articles