A Novel Hybrid Dimension Reduction Technique for Undersized High Dimensional Gene Expression Data Sets Using Information Complexity Criterion for Cancer Classification

Open Access

1 January 2015

journal article
research article
Published by Hindawi Limited in Computational and Mathematical Methods in Medicine

Vol. 2015, 1-14
https://doi.org/10.1155/2015/370640

Abstract

Gene expression data typically are large, complex, and highly noisy. Their dimension is high with several thousand genes (i.e., features) but with only a limited number of observations (i.e., samples). Although the classical principal component analysis (PCA) method is widely used as a first standard step in dimension reduction and in supervised and unsupervised classification, it suffers from several shortcomings in the case of data sets involving undersized samples, since the sample covariance matrix degenerates and becomes singular. In this paper we address these limitations within the context of probabilistic PCA (PPCA) by introducing and developing a new and novel approach using maximum entropy covariance matrix and its hybridized smoothed covariance estimators. To reduce the dimensionality of the data and to choose the number of probabilistic PCs (PPCs) to be retained, we further introduce and develop celebrated Akaike’s information criterion (AIC), consistent Akaike’s information criterion (CAIC), and the information theoretic measure of complexity (ICOMP) criterion of Bozdogan. Six publicly available undersized benchmark data sets were analyzed to show the utility, flexibility, and versatility of our approach with hybridized smoothed covariance matrix estimators, which do not degenerate to perform the PPCA to reduce the dimension and to carry out supervised classification of cancer groups in high dimensions.

Keywords

Funding Information

Council of Higher Education of Turkey

This publication has 26 references indexed in Scilit:

Probabilistic principal component analysis for metabolomic data
BMC Bioinformatics, 2010
Identification of differential gene pathways with principal component analysis
Bioinformatics, 2009
Principal component analysis of native ensembles of biomolecular structures (PCA_NEST): insights into functional dynamics
Bioinformatics, 2009
Supervised principal component analysis for gene set enrichment of microarray data with continuous or survival outcomes
Bioinformatics, 2008
Independent component analysis-based penalized discriminant method for tumor classification using gene expression data
Bioinformatics, 2006
Prediction of central nervous system embryonal tumour outcome based on gene expression
Nature, 2002
Akaike's Information Criterion and Recent Developments in Information Complexity
Journal of Mathematical Psychology, 2000
Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling
Nature, 2000
On the information-based measure of covariance complexity and its application to the evaluation of multivariate linear models
Communications in Statistics - Theory and Methods, 1990
Empirical Bayes Estimation of the Multivariate Normal Covariance Matrix
The Annals of Statistics, 1980

Cited by 22 articles