Stacking models for nearly optimal link prediction in complex networks

4 September 2020

journal article
research article
Published by Proceedings of the National Academy of Sciences in Proceedings of the National Academy of Sciences of the United States of America

Vol. 117 (38), 23393-23400
https://doi.org/10.1073/pnas.1914950117

Abstract

Most real-world networks are incompletely observed. Algorithms that can accurately predict which links are missing can dramatically speed up network data collection and improve network model validation. Many algorithms now exist for predicting missing links, given a partially observed network, but it has remained unknown whether a single best predictor exists, how link predictability varies across methods and networks from different domains, and how close to optimality current methods are. We answer these questions by systematically evaluating 203 individual link predictor algorithms, representing three popular families of methods, applied to a large corpus of 550 structurally diverse networks from six scientific domains. We first show that individual algorithms exhibit a broad diversity of prediction errors, such that no one predictor or family is best, or worst, across all realistic inputs. We then exploit this diversity using network-based metalearning to construct a series of “stacked” models that combine predictors into a single algorithm. Applied to a broad range of synthetic networks, for which we may analytically calculate optimal performance, these stacked models achieve optimal or nearly optimal levels of accuracy. Applied to real-world networks, stacked models are superior, but their accuracy varies strongly by domain, suggesting that link prediction may be fundamentally easier in social networks than in biological or technological networks. These results indicate that the state of the art for link prediction comes from combining individual algorithms, which can achieve nearly optimal predictions. We close with a brief discussion of limitations and opportunities for further improvements.

Keywords

Funding Information

National Science Foundation (IIS-1452718)
National Science Foundation (IIS-1452718)
Army Research Office (W911NF-15-1-0259)

This publication has 32 references indexed in Scilit:

Significant Communities in Large Sparse Networks
PLOS ONE, 2012
A survey of cross-validation procedures for model selection
Statistics Surveys, 2010
Missing and spurious interactions and the reconstruction of complex networks
Proceedings of the National Academy of Sciences of the United States of America, 2009
Predicting missing links via local information
Zeitschrift für Physik B Condensed Matter, 2009
Hierarchical structure and the prediction of missing links in networks
Nature, 2008
Weighted rank aggregation of cluster validation measures: a Monte Carlo cross-entropy approach
Bioinformatics, 2007
The link‐prediction problem for social networks
Journal of the American Society for Information Science and Technology, 2007
Bayesian model averaging: a tutorial (with comments by M. Clyde, David Draper and E. I. George, and a rejoinder by the authors
Statistical Science, 1999
A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Journal of Computer and System Sciences, 1997
No free lunch theorems for optimization
IEEE Transactions on Evolutionary Computation, 1997

Cited by 79 articles