Transcription Factor Bound Regions Prediction: Word2Vec Technique with Convolutional Neural Network

Abstract
Genome-wide epigenomic datasets allow us to validate the biological function of motifs and understand the regulatory mechanisms more comprehensively. How different motifs determine whether transcription factors (TFs) can bind to DNA at a specific position is a critical research question. In this project, we apply computational techniques that were used in Natural Language Processing (NLP) to predict the Transcription Factor Bound Regions (TFBRs) given motif instances. Most existing motif prediction methods using deep neural network apply base sequences with one-hot encoding as an input feature to realize TFBRs identification, contributing to low-resolution and indirect binding mechanisms. However, how the collective effect of motifs on binding sites is complicated to figure out. In our pipeline, we apply Word2Vec algorithm, with names of motifs as an input to predict TFBRs utilizing Convolutional Neural Network (CNN) to realize binary classification, based on the ENCODE dataset. In this regard, we consider different types of motifs as separate “words”, and their corresponding TFBR as the meanings of “sentences”. One “sentence” itself is merely the combination of these motifs, and all “sentences” compose of the whole “passage”. For each binding site, we do the binary classification within different cell types to show the performance of our model in different binding sites and cell types. Each “word” has a corresponding vector in high dimensions, and the distances between each vector can be figured out, so we can extract the similarity between each motif, and the explicit binding mechanism from our model. We apply Convolutional Neural Network (CNN) to extract features in the process of mapping and pooling from motif vectors extracted by Word2Vec Algorithm and gain the result of 87% accuracy at the peak.

This publication has 1 reference indexed in Scilit: