Inductive matrix completion for predicting gene–disease associations

Open Access

11 June 2014

journal article
research article
Published by Oxford University Press (OUP) in Bioinformatics

Vol. 30 (12), i60-i68
https://doi.org/10.1093/bioinformatics/btu269

Abstract

Motivation: Most existing methods for predicting causal disease genes rely on specific type of evidence, and are therefore limited in terms of applicability. More often than not, the type of evidence available for diseases varies—for example, we may know linked genes, keywords associated with the disease obtained by mining text, or co-occurrence of disease symptoms in patients. Similarly, the type of evidence available for genes varies—for example, specific microarray probes convey information only for certain sets of genes. In this article, we apply a novel matrix-completion method called Inductive Matrix Completion to the problem of predicting gene-disease associations; it combines multiple types of evidence (features) for diseases and genes to learn latent factors that explain the observed gene–disease associations. We construct features from different biological sources such as microarray expression data and disease-related textual data. A crucial advantage of the method is that it is inductive; it can be applied to diseases not seen at training time, unlike traditional matrix-completion approaches and network-based inference methods that are transductive. Results: Comparison with state-of-the-art methods on diseases from the Online Mendelian Inheritance in Man (OMIM) database shows that the proposed approach is substantially better—it has close to one-in-four chance of recovering a true association in the top 100 predictions, compared to the recently proposed Catapult method (second best) that has no previously known gene associations, and for predicting novel genes, i.e. genes that are previously not linked to diseases. Thus the method is capable of predicting novel genes even for well-characterized diseases. We also validate the novelty of predictions by evaluating the method on recently reported OMIM associations and on associations recently reported in the literature. Availability: Source code and datasets can be downloaded from http://bigdata.ices.utexas.edu/project/gene-disease. Contact:naga86@cs.utexas.edu

Keywords

This publication has 33 references indexed in Scilit:

Associating Genes and Protein Complexes with Disease via Network Propagation
PLoS Computational Biology, 2010
Discovering disease-genes by topological features in human protein–protein interaction network
Bioinformatics, 2006
A text-mining analysis of the human phenome
European Journal of Human Genetics, 2006
The Zebrafish Information Network: the zebrafish model organism database
Nucleic Acids Research, 2006
Speeding disease gene discovery by sequence based candidate prioritization
BMC Bioinformatics, 2005
WormBase: a comprehensive data resource for Caenorhabditis biology and genomics
Nucleic Acids Research, 2004
Genome-wide identification of genes likely to be involved in human genetic disease
Nucleic Acids Research, 2004
The Genetic Association Database
Nature Genetics, 2004
GEISHA, a whole‐mount in situ hybridization gene expression screen in chicken embryos
Developmental Dynamics, 2004
Saccharomyces Genome Database (SGD) provides secondary gene annotation using the Gene Ontology (GO)
Nucleic Acids Research, 2002

Cited by 236 articles