Optimality Driven Nearest Centroid Classification from Genomic Data
Open Access
- 3 October 2007
- journal article
- research article
- Published by Public Library of Science (PLoS) in PLOS ONE
- Vol. 2 (10), e1002
- https://doi.org/10.1371/journal.pone.0001002
Abstract
Nearest-centroid classifiers have recently been successfully employed in high-dimensional applications, such as in genomics. A necessary step when building a classifier for high-dimensional data is feature selection. Feature selection is frequently carried out by computing univariate scores for each feature individually, without consideration for how a subset of features performs as a whole. We introduce a new feature selection approach for high-dimensional nearest centroid classifiers that instead is based on the theoretically optimal choice of a given number of features, which we determine directly here. This allows us to develop a new greedy algorithm to estimate this optimal nearest-centroid classifier with a given number of features. In addition, whereas the centroids are usually formed from maximum likelihood estimates, we investigate the applicability of high-dimensional shrinkage estimates of centroids. We apply the proposed method to clinical classification based on gene-expression microarrays, demonstrating that the proposed method can outperform existing nearest centroid classifiers.This publication has 18 references indexed in Scilit:
- Eigengene-based linear discriminant model for tumor classification using gene expression microarray dataBioinformatics, 2006
- Regularized linear discriminant analysis and its application in microarraysBiostatistics, 2006
- Classification of microarrays to nearest centroidsBioinformatics, 2005
- A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional GenomicsStatistical Applications in Genetics and Molecular Biology, 2005
- Some theory for Fisher's linear discriminant function, `naive Bayes', and some alternatives when there are many more variables than observationsBernoulli, 2004
- Least angle regressionThe Annals of Statistics, 2004
- Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression DataJournal of the American Statistical Association, 2002
- Systematic variation in gene expression patterns in human cancer cell linesNature Genetics, 2000
- Distinct types of diffuse large B-cell lymphoma identified by gene expression profilingNature, 2000
- Variable selection techniques in discriminant analysis: I. DescriptionBritish Journal of Mathematical and Statistical Psychology, 1982