Robust ensemble of handcrafted and learned approaches for DNA-binding proteins

Open Access

4 May 2021

journal article
research article
Published by Emerald in Applied Computing and Informatics

Vol. ahead-of-p (ahead-of-p)
https://doi.org/10.1108/aci-03-2021-0051

Abstract

Purpose: Automatic DNA-binding protein (DNA-BP) classification is now an essential proteomic technology. Unfortunately, many systems reported in the literature are tested on only one or two datasets/tasks. The purpose of this study is to create the most optimal and universal system for DNA-BP classification, one that performs competitively across several DNA-BP classification tasks. Design/methodology/approach: Efficient DNA-BP classifier systems require the discovery of powerful protein representations and feature extraction methods. Experiments were performed that combined and compared descriptors extracted from state-of-the-art matrix/image protein representations. These descriptors were trained on separate support vector machines (SVMs) and evaluated. Convolutional neural networks with different parameter settings were fine-tuned on two matrix representations of proteins. Decisions were fused with the SVMs using the weighted sum rule and evaluated to experimentally derive the most powerful general-purpose DNA-BP classifier system. Findings: The best ensemble proposed here produced comparable, if not superior, classification results on a broad and fair comparison with the literature across four different datasets representing a variety of DNA-BP classification tasks, thereby demonstrating both the power and generalizability of the proposed system. Originality/value: Most DNA-BP methods proposed in the literature are only validated on one (rarely two) datasets/tasks. In this work, the authors report the performance of our general-purpose DNA-BP system on four datasets representing different DNA-BP classification tasks. The excellent results of the proposed best classifier system demonstrate the power of the proposed approach. These results can now be used for baseline comparisons by other researchers in the field.

Keywords

This publication has 52 references indexed in Scilit:

Using Over-Represented Tetrapeptides to Predict Protein Submitochondria Locations
Acta Biotheoretica, 2013
A novel protein structural classes prediction method based on predicted secondary structure
Biochimie, 2012
Wavelet images and Chou’s pseudo amino acid composition for protein classification
Amino Acids, 2011
Some remarks on protein attribute prediction and pseudo amino acid composition
Journal of Theoretical Biology, 2011
Predicting subcellular location of apoptosis proteins with pseudo amino acid composition: approach from amino acid substitution matrix and auto covariance transformation
Amino Acids, 2011
Identification of DNA-binding proteins using support vector machines and evolutionary profiles
BMC Bioinformatics, 2007
MemType-2L: A Web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM
Biochemical and Biophysical Research Communications, 2007
Efficient Prediction of Nucleic Acid Binding Function from Low-resolution Protein Structures
Journal of Molecular Biology, 2006
Prediction of protein cellular attributes using pseudo‐amino acid composition
Proteins, 2001
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Research, 1997

Cited by 1 article