Efficient Feature Selection and Classification of Protein Sequence Data in Bioinformatics
Open Access
- 1 January 2014
- journal article
- research article
- Published by Hindawi Limited in The Scientific World Journal
- Vol. 2014, 1-12
- https://doi.org/10.1155/2014/173869
Abstract
Bioinformatics has been an emerging area of research for the last three decades. The ultimate aims of bioinformatics were to store and manage the biological data, and develop and analyze computational tools to enhance their understanding. The size of data accumulated under various sequencing projects is increasing exponentially, which presents difficulties for the experimental methods. To reduce the gap between newly sequenced protein and proteins with known functions, many computational techniques involving classification and clustering algorithms were proposed in the past. The classification of protein sequences into existing superfamilies is helpful in predicting the structure and function of large amount of newly discovered proteins. The existing classification results are unsatisfactory due to a huge size of features obtained through various feature encoding methods. In this work, a statistical metric-based feature selection technique has been proposed in order to reduce the size of the extracted feature vector. The proposed method of protein classification shows significant improvement in terms of performance measure metrics: accuracy, sensitivity, specificity, recall, F-measure, and so forth.Keywords
Funding Information
- Universiti Teknologi Petronas
This publication has 28 references indexed in Scilit:
- Protein sequence comparison based on K-string dictionaryGene, 2013
- Centroid based clustering of high throughput sequencing reads based on n-mer countsBMC Bioinformatics, 2013
- Two-Stage Approach for Protein Superfamily ClassificationComputational Biology Journal, 2013
- Protein sequences classification by means of feature extraction with substitution matricesBMC Bioinformatics, 2010
- UniProt: the Universal Protein knowledgebaseNucleic Acids Research, 2004
- The Protein Data BankNucleic Acids Research, 2000
- Feature selection for classificationIntelligent Data Analysis, 1997
- Gapped BLAST and PSI-BLAST: a new generation of protein database search programsNucleic Acids Research, 1997
- Basic Local Alignment Search ToolJournal of Molecular Biology, 1990
- Basic local alignment search toolJournal of Molecular Biology, 1990