Feature selection in omics prediction problems using cat scores and false nondiscovery rate control

Open Access

1 March 2010

journal article
Published by Institute of Mathematical Statistics in The Annals of Applied Statistics

Vol. 4 (1), 503-519
https://doi.org/10.1214/09-aoas277

Abstract

We revisit the problem of feature selection in linear discriminant analysis (LDA), that is, when features are correlated. First, we introduce a pooled centroids formulation of the multiclass LDA predictor function, in which the relative weights of Mahalanobis-transformed predictors are given by correlation-adjusted t-scores (cat scores). Second, for feature selection we propose thresholding cat scores by controlling false nondiscovery rates (FNDR). Third, training of the classifier is based on James–Stein shrinkage estimates of correlations and variances, where regularization parameters are chosen analytically without resampling. Overall, this results in an effective and computationally inexpensive framework for high-dimensional prediction with natural feature selection. The proposed shrinkage discriminant procedures are implemented in the R package “sda” available from the R repository CRAN.

Keywords

Other Versions

This publication has 29 references indexed in Scilit:

Covariance-Regularized Regression and Classification for high Dimensional Problems
Journal of the Royal Statistical Society Series B: Statistical Methodology, 2009
A general modular framework for gene set enrichment analysis
BMC Bioinformatics, 2009
High-dimensional classification using features annealed independence rules
The Annals of Statistics, 2008
CMA – a comprehensive Bioconductor package for supervised classification with high dimensional data
BMC Bioinformatics, 2008
Higher criticism thresholding: Optimal feature selection when useful features are rare and weak
Proceedings of the National Academy of Sciences of the United States of America, 2008
A unified approach to false discovery rate estimation
BMC Bioinformatics, 2008
Optimality Driven Nearest Centroid Classification from Genomic Data
PLOS ONE, 2007
Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling
Nature, 2000
Violin Plots: A Box Plot-Density Trace Synergism
The American Statistician, 1998
The Efficiency of Logistic Regression Compared to Normal Discriminant Analysis
Journal of the American Statistical Association, 1975

Cited by 90 articles