Robust ensemble of handcrafted and learned approaches for DNA-binding proteins
Open Access
- 4 May 2021
- journal article
- research article
- Published by Emerald in Applied Computing and Informatics
- Vol. ahead-of-p (ahead-of-p)
- https://doi.org/10.1108/aci-03-2021-0051
Abstract
Purpose: Automatic DNA-binding protein (DNA-BP) classification is now an essential proteomic technology. Unfortunately, many systems reported in the literature are tested on only one or two datasets/tasks. The purpose of this study is to create the most optimal and universal system for DNA-BP classification, one that performs competitively across several DNA-BP classification tasks. Design/methodology/approach: Efficient DNA-BP classifier systems require the discovery of powerful protein representations and feature extraction methods. Experiments were performed that combined and compared descriptors extracted from state-of-the-art matrix/image protein representations. These descriptors were trained on separate support vector machines (SVMs) and evaluated. Convolutional neural networks with different parameter settings were fine-tuned on two matrix representations of proteins. Decisions were fused with the SVMs using the weighted sum rule and evaluated to experimentally derive the most powerful general-purpose DNA-BP classifier system. Findings: The best ensemble proposed here produced comparable, if not superior, classification results on a broad and fair comparison with the literature across four different datasets representing a variety of DNA-BP classification tasks, thereby demonstrating both the power and generalizability of the proposed system. Originality/value: Most DNA-BP methods proposed in the literature are only validated on one (rarely two) datasets/tasks. In this work, the authors report the performance of our general-purpose DNA-BP system on four datasets representing different DNA-BP classification tasks. The excellent results of the proposed best classifier system demonstrate the power of the proposed approach. These results can now be used for baseline comparisons by other researchers in the field.Keywords
This publication has 52 references indexed in Scilit:
- Using Over-Represented Tetrapeptides to Predict Protein Submitochondria LocationsActa Biotheoretica, 2013
- A novel protein structural classes prediction method based on predicted secondary structureBiochimie, 2012
- Wavelet images and Chou’s pseudo amino acid composition for protein classificationAmino Acids, 2011
- Some remarks on protein attribute prediction and pseudo amino acid compositionJournal of Theoretical Biology, 2011
- Predicting subcellular location of apoptosis proteins with pseudo amino acid composition: approach from amino acid substitution matrix and auto covariance transformationAmino Acids, 2011
- Identification of DNA-binding proteins using support vector machines and evolutionary profilesBMC Bioinformatics, 2007
- MemType-2L: A Web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSMBiochemical and Biophysical Research Communications, 2007
- Efficient Prediction of Nucleic Acid Binding Function from Low-resolution Protein StructuresJournal of Molecular Biology, 2006
- Prediction of protein cellular attributes using pseudo‐amino acid compositionProteins, 2001
- Gapped BLAST and PSI-BLAST: a new generation of protein database search programsNucleic Acids Research, 1997