Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature

Open Access

12 November 2008

journal article
research article
Published by Oxford University Press (OUP) in Bioinformatics

Vol. 25 (1), 30-35
https://doi.org/10.1093/bioinformatics/btn583

Abstract

Motivation: In this work, we aim to develop a computational approach for predicting DNA-binding sites in proteins from amino acid sequences. To avoid overfitting with this method, all available DNA-binding proteins from the Protein Data Bank (PDB) are used to construct the models. The random forest (RF) algorithm is used because it is fast and has robust performance for different parameter values. A novel hybrid feature is presented which incorporates evolutionary information of the amino acid sequence, secondary structure (SS) information and orthogonal binary vector (OBV) information which reflects the characteristics of 20 kinds of amino acids for two physical–chemical properties (dipoles and volumes of the side chains). The numbers of binding and non-binding residues in proteins are highly unbalanced, so a novel scheme is proposed to deal with the problem of imbalanced datasets by downsizing the majority class. Results: The results show that the RF model achieves 91.41% overall accuracy with Matthew's correlation coefficient of 0.70 and an area under the receiver operating characteristic curve (AUC) of 0.913. To our knowledge, the RF method using the hybrid feature is currently the computationally optimal approach for predicting DNA-binding sites in proteins from amino acid sequences without using three-dimensional (3D) structural information. We have demonstrated that the prediction results are useful for understanding protein–DNA interactions. Availability: DBindR web server implementation is freely available at http://www.cbi.seu.edu.cn/DBindR/DBindR.htm. Contact:xsun@seu.edu.cn Supplementary information: Supplementary data are available at Bioinformatics online.

Keywords

This publication has 28 references indexed in Scilit:

Predicting protein–protein interactions based only on sequences information
Proceedings of the National Academy of Sciences, 2007
Residue‐level prediction of DNA‐binding sites and its application on DNA‐binding protein predictions
FEBS Letters, 2007
DISPLAR: an accurate method for predicting DNA-binding sites on protein surfaces
Nucleic Acids Research, 2007
Structure-based prediction of C2H2 zinc-finger binding specificity: sensitivity to docking geometry
Nucleic Acids Research, 2007
Better prediction of the location of α‐turns in proteins with support vector machine
Proteins-Structure Function and Bioinformatics, 2006
BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences
Nucleic Acids Research, 2006
Using evolutionary and structural information to predict DNA‐binding sites on DNA‐binding proteins
Proteins-Structure Function and Bioinformatics, 2006
The Protein Data Bank
Nucleic Acids Research, 2000
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Research, 1997
Basic local alignment search tool
Journal of Molecular Biology, 1990

Cited by 135 articles