Probabilistic classifiers with high-dimensional data

Open Access

17 November 2010

journal article
research article
Published by Oxford University Press (OUP) in Biostatistics

Vol. 12 (3), 399-412
https://doi.org/10.1093/biostatistics/kxq069

Abstract

For medical classification problems, it is often desirable to have a probability associated with each class. Probabilistic classifiers have received relatively little attention for small n large p classification problems despite of their importance in medical decision making. In this paper, we introduce 2 criteria for assessment of probabilistic classifiers: well-calibratedness and refinement and develop corresponding evaluation measures. We evaluated several published high-dimensional probabilistic classifiers and developed 2 extensions of the Bayesian compound covariate classifier. Based on simulation studies and analysis of gene expression microarray data, we found that proper probabilistic classification is more difficult than deterministic classification. It is important to ensure that a probabilistic classifier is well calibrated or at least not “anticonservative” using the methods developed here. We provide this evaluation for several probabilistic classifiers and also evaluate their refinement as a function of sample size under weak and strong signal conditions. We also present a cross-validation method for evaluating the calibration and refinement of any probabilistic classifier on any data set.

Keywords

This publication has 14 references indexed in Scilit:

Distribution modeling and simulation of gene expression data
Computational Statistics & Data Analysis, 2009
A protocol for building and evaluating predictors of disease state based on microarray data
Bioinformatics, 2005
A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics
Statistical Applications in Genetics and Molecular Biology, 2005
BagBoosting for tumor classification with gene expression data
Bioinformatics, 2004
Least angle regression
The Annals of Statistics, 2004
An Example of Slow Convergence of the Bootstrap in High Dimensions
The American Statistician, 2004
A gene expression-based method to diagnose clinically distinct subgroups of diffuse large B cell lymphoma
Proceedings of the National Academy of Sciences of the United States of America, 2003
Diagnosis of multiple cancer types by shrunken centroids of gene expression
Proceedings of the National Academy of Sciences of the United States of America, 2002
Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data
Journal of the American Statistical Association, 2002
The Comparison and Evaluation of Forecasters
Journal of the Royal Statistical Society: Series D (The Statistician), 1983

Cited by 23 articles