Inductive matrix completion for predicting gene–disease associations
Open Access
- 11 June 2014
- journal article
- research article
- Published by Oxford University Press (OUP) in Bioinformatics
- Vol. 30 (12), i60-i68
- https://doi.org/10.1093/bioinformatics/btu269
Abstract
Motivation: Most existing methods for predicting causal disease genes rely on specific type of evidence, and are therefore limited in terms of applicability. More often than not, the type of evidence available for diseases varies—for example, we may know linked genes, keywords associated with the disease obtained by mining text, or co-occurrence of disease symptoms in patients. Similarly, the type of evidence available for genes varies—for example, specific microarray probes convey information only for certain sets of genes. In this article, we apply a novel matrix-completion method called Inductive Matrix Completion to the problem of predicting gene-disease associations; it combines multiple types of evidence (features) for diseases and genes to learn latent factors that explain the observed gene–disease associations. We construct features from different biological sources such as microarray expression data and disease-related textual data. A crucial advantage of the method is that it is inductive; it can be applied to diseases not seen at training time, unlike traditional matrix-completion approaches and network-based inference methods that are transductive. Results: Comparison with state-of-the-art methods on diseases from the Online Mendelian Inheritance in Man (OMIM) database shows that the proposed approach is substantially better—it has close to one-in-four chance of recovering a true association in the top 100 predictions, compared to the recently proposed Catapult method (second best) that has no previously known gene associations, and for predicting novel genes, i.e. genes that are previously not linked to diseases. Thus the method is capable of predicting novel genes even for well-characterized diseases. We also validate the novelty of predictions by evaluating the method on recently reported OMIM associations and on associations recently reported in the literature. Availability: Source code and datasets can be downloaded from http://bigdata.ices.utexas.edu/project/gene-disease. Contact:naga86@cs.utexas.eduKeywords
This publication has 33 references indexed in Scilit:
- Associating Genes and Protein Complexes with Disease via Network PropagationPLoS Computational Biology, 2010
- Discovering disease-genes by topological features in human protein–protein interaction networkBioinformatics, 2006
- A text-mining analysis of the human phenomeEuropean Journal of Human Genetics, 2006
- The Zebrafish Information Network: the zebrafish model organism databaseNucleic Acids Research, 2006
- Speeding disease gene discovery by sequence based candidate prioritizationBMC Bioinformatics, 2005
- WormBase: a comprehensive data resource for Caenorhabditis biology and genomicsNucleic Acids Research, 2004
- Genome-wide identification of genes likely to be involved in human genetic diseaseNucleic Acids Research, 2004
- The Genetic Association DatabaseNature Genetics, 2004
- GEISHA, a whole‐mount in situ hybridization gene expression screen in chicken embryosDevelopmental Dynamics, 2004
- Saccharomyces Genome Database (SGD) provides secondary gene annotation using the Gene Ontology (GO)Nucleic Acids Research, 2002