Feature selection in omics prediction problems using cat scores and false nondiscovery rate control
Open Access
- 1 March 2010
- journal article
- Published by Institute of Mathematical Statistics in The Annals of Applied Statistics
- Vol. 4 (1), 503-519
- https://doi.org/10.1214/09-aoas277
Abstract
We revisit the problem of feature selection in linear discriminant analysis (LDA), that is, when features are correlated. First, we introduce a pooled centroids formulation of the multiclass LDA predictor function, in which the relative weights of Mahalanobis-transformed predictors are given by correlation-adjusted t-scores (cat scores). Second, for feature selection we propose thresholding cat scores by controlling false nondiscovery rates (FNDR). Third, training of the classifier is based on James–Stein shrinkage estimates of correlations and variances, where regularization parameters are chosen analytically without resampling. Overall, this results in an effective and computationally inexpensive framework for high-dimensional prediction with natural feature selection. The proposed shrinkage discriminant procedures are implemented in the R package “sda” available from the R repository CRAN.Keywords
Other Versions
This publication has 29 references indexed in Scilit:
- Covariance-Regularized Regression and Classification for high Dimensional ProblemsJournal of the Royal Statistical Society Series B: Statistical Methodology, 2009
- A general modular framework for gene set enrichment analysisBMC Bioinformatics, 2009
- High-dimensional classification using features annealed independence rulesThe Annals of Statistics, 2008
- CMA – a comprehensive Bioconductor package for supervised classification with high dimensional dataBMC Bioinformatics, 2008
- Higher criticism thresholding: Optimal feature selection when useful features are rare and weakProceedings of the National Academy of Sciences of the United States of America, 2008
- A unified approach to false discovery rate estimationBMC Bioinformatics, 2008
- Optimality Driven Nearest Centroid Classification from Genomic DataPLOS ONE, 2007
- Distinct types of diffuse large B-cell lymphoma identified by gene expression profilingNature, 2000
- Violin Plots: A Box Plot-Density Trace SynergismThe American Statistician, 1998
- The Efficiency of Logistic Regression Compared to Normal Discriminant AnalysisJournal of the American Statistical Association, 1975