Optimality Driven Nearest Centroid Classification from Genomic Data

Open Access

3 October 2007

journal article
research article
Published by Public Library of Science (PLoS) in PLOS ONE

Vol. 2 (10), e1002
https://doi.org/10.1371/journal.pone.0001002

Abstract

Nearest-centroid classifiers have recently been successfully employed in high-dimensional applications, such as in genomics. A necessary step when building a classifier for high-dimensional data is feature selection. Feature selection is frequently carried out by computing univariate scores for each feature individually, without consideration for how a subset of features performs as a whole. We introduce a new feature selection approach for high-dimensional nearest centroid classifiers that instead is based on the theoretically optimal choice of a given number of features, which we determine directly here. This allows us to develop a new greedy algorithm to estimate this optimal nearest-centroid classifier with a given number of features. In addition, whereas the centroids are usually formed from maximum likelihood estimates, we investigate the applicability of high-dimensional shrinkage estimates of centroids. We apply the proposed method to clinical classification based on gene-expression microarrays, demonstrating that the proposed method can outperform existing nearest centroid classifiers.

This publication has 18 references indexed in Scilit:

Eigengene-based linear discriminant model for tumor classification using gene expression microarray data
Bioinformatics, 2006
Regularized linear discriminant analysis and its application in microarrays
Biostatistics, 2006
Classification of microarrays to nearest centroids
Bioinformatics, 2005
A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics
Statistical Applications in Genetics and Molecular Biology, 2005
Some theory for Fisher's linear discriminant function, `naive Bayes', and some alternatives when there are many more variables than observations
Bernoulli, 2004
Least angle regression
The Annals of Statistics, 2004
Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data
Journal of the American Statistical Association, 2002
Systematic variation in gene expression patterns in human cancer cell lines
Nature Genetics, 2000
Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling
Nature, 2000
Variable selection techniques in discriminant analysis: I. Description
British Journal of Mathematical and Statistical Psychology, 1982

Cited by 23 articles