Predicting Co-Complexed Protein Pairs from Heterogeneous Data

Open Access

18 April 2008

journal article
research article
Published by Public Library of Science (PLoS) in PLoS Computational Biology

Vol. 4 (4), e1000054
https://doi.org/10.1371/journal.pcbi.1000054

Abstract

Proteins do not carry out their functions alone. Instead, they often act by participating in macromolecular complexes and play different functional roles depending on the other members of the complex. It is therefore interesting to identify co-complex relationships. Although protein complexes can be identified in a high-throughput manner by experimental technologies such as affinity purification coupled with mass spectrometry (APMS), these large-scale datasets often suffer from high false positive and false negative rates. Here, we present a computational method that predicts co-complexed protein pair (CCPP) relationships using kernel methods from heterogeneous data sources. We show that a diffusion kernel based on random walks on the full network topology yields good performance in predicting CCPPs from protein interaction networks. In the setting of direct ranking, a diffusion kernel performs much better than the mutual clustering coefficient. In the setting of SVM classifiers, a diffusion kernel performs much better than a linear kernel. We also show that combination of complementary information improves the performance of our CCPP recognizer. A summation of three diffusion kernels based on two-hybrid, APMS, and genetic interaction networks and three sequence kernels achieves better performance than the sequence kernels or diffusion kernels alone. Inclusion of additional features achieves a still better ROC₅₀ of 0.937. Assuming a negative-to-positive ratio of 600∶1, the final classifier achieves 89.3% coverage at an estimated false discovery rate of 10%. Finally, we applied our prediction method to two recently described APMS datasets. We find that our predicted positives are highly enriched with CCPPs that are identified by both datasets, suggesting that our method successfully identifies true CCPPs. An SVM classifier trained from heterogeneous data sources provides accurate predictions of CCPPs in yeast. This computational method thereby provides an inexpensive method for identifying protein complexes that extends and complements high-throughput experimental data. Many proteins perform their jobs as part of multi-protein units called complexes, and several technologies exist to identify these complexes and their components with varying precision and throughput. In this work, we describe and apply a computational framework for combining a variety of experimental data to identify pairs of yeast proteins that partipicate in a complex—so-called co-complexed protein pairs (CCPPs). The method uses machine learning to generalize from well-characterized CCPPs, making predictions of novel CCPPs on the basis of sequence similarity, tandem affinity mass spectrometry data, yeast two-hybrid data, genetic interactions, microarray expression data, ChIP-chip assays, and colocalization by fluorescence microscopy. The resulting model accurately summarizes this heterogeneous body of data: in a cross-validated test, the model achieves an estimated coverage of 89% at a false discovery rate of 10%. The final collection of predicted CCPPs is available as a public resource. These predictions, as well as the general methodology described here, provide a valuable summary of diverse yeast interaction data and generate quantitative, testable hypotheses about novel CCPPs.

Keywords

This publication has 53 references indexed in Scilit:

Global landscape of protein complexes in the yeast Saccharomyces cerevisiae
Nature, 2006
Choosing negative examples for the prediction of protein-protein interactions
BMC Bioinformatics, 2006
Proteome survey reveals modularity of the yeast cell machinery
Nature, 2006
BioGRID: a general repository for interaction datasets
Nucleic Acids Research, 2006
Global analysis of protein localization in budding yeast
Nature, 2003
Transcriptional Regulatory Networks in Saccharomyces cerevisiae
Science, 2002
Functional organization of the yeast proteome by systematic analysis of protein complexes
Nature, 2002
Correlated sequence-signatures as markers of protein-protein interaction
Journal of Molecular Biology, 2001
The Transcriptional Program of Sporulation in Budding Yeast
Science, 1998
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Research, 1997

Cited by 56 articles