D2VCB: A Hybrid Deep Neural Network for the Prediction of in-vivo Protein-DNA Binding from Combined DNA Sequence

Abstract

Prediction of in-vivo protein-DNA binding is an important, but challenging task in the broad field of computational biology. Although some methods based on deep learning have succeed in modeling in-vivo protein-DNA binding, they often simply extract the sequence features from the original DNA sequence without consideration of other sequence features, such as their reverse, complementary and reverse complementary sequences. Also, one-hot encoding of DNA sequence is vulnerable to the curse of dimensionality, which leads to unwanted equidistance of pairwise sequences. To address these problems, we propose D2VCB (dna2vec, convolution, bi-LSTM), a novel hybrid deep neural network framework using dna2vec to predict in-vivo protein-DNA binding events. We extract input features from DNA original sequences, reverse sequences, complementary and complementary reverse sequences, and then use dna2vec to compute a distributed representation of k-mer. In our D2VCB model, the convolution layer captures motif features, while the recurrent layer captures long-term dependencies among motif features so as to improve prediction accuracy. Our performance comparison experiments show that D2VCB outperforms significantly other existing methods in terms of multiple performance metrics.

Keywords

This publication has 21 references indexed in Scilit:

A method to predict the impact of regulatory variants from DNA sequence
Nature Genetics, 2015
A comparative analysis of transcription factor binding models learned from PBM, HT-SELEX and ChIP data
Nucleic Acids Research, 2014
nDNA-prot: identification of DNA-binding proteins based on unbalanced classification
BMC Bioinformatics, 2014
Discriminative motif analysis of high-throughput dataset
Bioinformatics, 2013
kmer-SVM: a web server for identifying predictive regulatory sequence features in genomic data sets
Nucleic Acids Research, 2013
ChIP–seq and beyond: new and improved methodologies to detect and characterize protein–DNA interactions
Nature Reviews Genetics, 2012
Locating mammalian transcription factor binding sites: A survey of computational and experimental techniques
Genome Research, 2006
An introduction to ROC analysis
Pattern Recognition Letters, 2005
Finding Short DNA Motifs Using Permuted Markov Models
Journal of Computational Biology, 2005
Long Short-Term Memory
Neural Computation, 1997

Cited by 5 articles