Probabilistic classifiers with high-dimensional data
Open Access
- 17 November 2010
- journal article
- research article
- Published by Oxford University Press (OUP) in Biostatistics
- Vol. 12 (3), 399-412
- https://doi.org/10.1093/biostatistics/kxq069
Abstract
For medical classification problems, it is often desirable to have a probability associated with each class. Probabilistic classifiers have received relatively little attention for small n large p classification problems despite of their importance in medical decision making. In this paper, we introduce 2 criteria for assessment of probabilistic classifiers: well-calibratedness and refinement and develop corresponding evaluation measures. We evaluated several published high-dimensional probabilistic classifiers and developed 2 extensions of the Bayesian compound covariate classifier. Based on simulation studies and analysis of gene expression microarray data, we found that proper probabilistic classification is more difficult than deterministic classification. It is important to ensure that a probabilistic classifier is well calibrated or at least not “anticonservative” using the methods developed here. We provide this evaluation for several probabilistic classifiers and also evaluate their refinement as a function of sample size under weak and strong signal conditions. We also present a cross-validation method for evaluating the calibration and refinement of any probabilistic classifier on any data set.Keywords
This publication has 14 references indexed in Scilit:
- Distribution modeling and simulation of gene expression dataComputational Statistics & Data Analysis, 2009
- A protocol for building and evaluating predictors of disease state based on microarray dataBioinformatics, 2005
- A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional GenomicsStatistical Applications in Genetics and Molecular Biology, 2005
- BagBoosting for tumor classification with gene expression dataBioinformatics, 2004
- Least angle regressionThe Annals of Statistics, 2004
- An Example of Slow Convergence of the Bootstrap in High DimensionsThe American Statistician, 2004
- A gene expression-based method to diagnose clinically distinct subgroups of diffuse large B cell lymphomaProceedings of the National Academy of Sciences of the United States of America, 2003
- Diagnosis of multiple cancer types by shrunken centroids of gene expressionProceedings of the National Academy of Sciences of the United States of America, 2002
- Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression DataJournal of the American Statistical Association, 2002
- The Comparison and Evaluation of ForecastersJournal of the Royal Statistical Society: Series D (The Statistician), 1983