Efficient Feature Selection and Classification of Protein Sequence Data in Bioinformatics

Open Access

1 January 2014

journal article
research article
Published by Hindawi Limited in The Scientific World Journal

Vol. 2014, 1-12
https://doi.org/10.1155/2014/173869

Abstract

Bioinformatics has been an emerging area of research for the last three decades. The ultimate aims of bioinformatics were to store and manage the biological data, and develop and analyze computational tools to enhance their understanding. The size of data accumulated under various sequencing projects is increasing exponentially, which presents difficulties for the experimental methods. To reduce the gap between newly sequenced protein and proteins with known functions, many computational techniques involving classification and clustering algorithms were proposed in the past. The classification of protein sequences into existing superfamilies is helpful in predicting the structure and function of large amount of newly discovered proteins. The existing classification results are unsatisfactory due to a huge size of features obtained through various feature encoding methods. In this work, a statistical metric-based feature selection technique has been proposed in order to reduce the size of the extracted feature vector. The proposed method of protein classification shows significant improvement in terms of performance measure metrics: accuracy, sensitivity, specificity, recall, F-measure, and so forth.

Keywords

Funding Information

Universiti Teknologi Petronas

This publication has 28 references indexed in Scilit:

Protein sequence comparison based on K-string dictionary
Gene, 2013
Centroid based clustering of high throughput sequencing reads based on n-mer counts
BMC Bioinformatics, 2013
Two-Stage Approach for Protein Superfamily Classification
Computational Biology Journal, 2013
Protein sequences classification by means of feature extraction with substitution matrices
BMC Bioinformatics, 2010
UniProt: the Universal Protein knowledgebase
Nucleic Acids Research, 2004
The Protein Data Bank
Nucleic Acids Research, 2000
Feature selection for classification
Intelligent Data Analysis, 1997
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Research, 1997
Basic Local Alignment Search Tool
Journal of Molecular Biology, 1990
Basic local alignment search tool
Journal of Molecular Biology, 1990

Cited by 21 articles