Gene2vec: distributed representation of genes based on co-expression
Open Access
- 3 February 2019
- journal article
- conference paper
- Published by Springer Science and Business Media LLC in BMC Genomics
- Vol. 20 (S1), 7-15
- https://doi.org/10.1186/s12864-018-5370-x
Abstract
BackgroundExisting functional description of genes are categorical, discrete, and mostly through manual process. In this work, we explore the idea of gene embedding, distributed representation of genes, in the spirit of word embedding.ResultsFrom a pure data-driven fashion, we trained a 200-dimension vector representation of all human genes, using gene co-expression patterns in 984 data sets from the GEO databases. These vectors capture functional relatedness of genes in terms of recovering known pathways - the average inner product (similarity) of genes within a pathway is 1.52X greater than that of random genes. Using t-SNE, we produced a gene co-expression map that shows local concentrations of tissue specific genes. We also illustrated the usefulness of the embedded gene vectors, laden with rich information on gene co-expression patterns, in tasks such as gene-gene interaction prediction.ConclusionsWe proposed a machine learning method that utilizes transcriptome-wide gene co-expression to generate a distributed representation of genes. We further demonstrated the utility of our distribution by predicting gene-gene interaction based solely on gene names. The distributed representation of genes could be useful for more bioinformatics applications.Keywords
This publication has 11 references indexed in Scilit:
- An integrative functional genomics framework for effective identification of novel regulatory variants in genome–phenome studiesGenome Medicine, 2018
- How to Generate a Good Word EmbeddingIEEE Intelligent Systems, 2016
- Multiscale Embedded Gene Co-expression Network AnalysisPLoS Computational Biology, 2015
- Continuous Distributed Representation of Biological Sequences for Deep Proteomics and GenomicsPLOS ONE, 2015
- The Genotype-Tissue Expression (GTEx) projectNature Genetics, 2013
- Prioritizing candidate disease genes by network-based boosting of genome-wide association dataGenome Research, 2011
- Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profilesProceedings of the National Academy of Sciences of the United States of America, 2005
- Entrez Gene: gene-centered information at NCBINucleic Acids Research, 2004
- Learning distributed representations of concepts using linear relational embeddingIEEE Transactions on Knowledge and Data Engineering, 2001
- KEGG: Kyoto Encyclopedia of Genes and GenomesNucleic Acids Research, 2000