Sparse Partial Least Squares Classification for High Dimensional Data
- 3 January 2010
- journal article
- Published by Walter de Gruyter GmbH in Statistical Applications in Genetics and Molecular Biology
- Vol. 9 (1), Article17
- https://doi.org/10.2202/1544-6115.1492
Abstract
Partial least squares (PLS) is a well known dimension reduction method which has been recently adapted for high dimensional classification problems in genome biology. We develop sparse versions of the recently proposed two PLS-based classification methods using sparse partial least squares (SPLS). These sparse versions aim to achieve variable selection and dimension reduction simultaneously. We consider both binary and multicategory classification. We provide analytical and simulation-based insights about the variable selection properties of these approaches and benchmark them on well known publicly available datasets that involve tumor classification with high dimensional gene expression data. We show that incorporation of SPLS into a generalized linear model (GLM) framework provides higher sensitivity in variable selection for multicategory classification with unbalanced sample sizes between classes. As the sample size increases, the two-stage approach provides comparable sensitivity with better specificity in variable selection. In binary classification and multicategory classification with balanced sample sizes, the two-stage approach provides comparable variable selection and prediction accuracy as the GLM version and is computationally more efficient.Keywords
This publication has 15 references indexed in Scilit:
- Sparse Partial Least Squares Regression for Simultaneous Dimension Reduction and Variable SelectionJournal of the Royal Statistical Society Series B: Statistical Methodology, 2010
- A Solution to Separation and Multicollinearity in Multiple Logistic RegressionJournal of Data Science, 2008
- Quantifying the Association between Gene Expressions and DNA-Markers by Penalized Canonical Correlation AnalysisStatistical Applications in Genetics and Molecular Biology, 2008
- A Sparse PLS for Variable Selection when Integrating Omics DataStatistical Applications in Genetics and Molecular Biology, 2008
- Classification using partial least squares with penalized logistic regressionBioinformatics, 2004
- BagBoosting for tumor classification with gene expression dataBioinformatics, 2004
- PLS Dimension Reduction for Classification with Microarray DataStatistical Applications in Genetics and Molecular Biology, 2004
- A solution to the problem of separation in logistic regressionStatistics in Medicine, 2002
- Diagnosis of multiple cancer types by shrunken centroids of gene expressionProceedings of the National Academy of Sciences of the United States of America, 2002
- Distinct types of diffuse large B-cell lymphoma identified by gene expression profilingNature, 2000