Sample size planning for developing classifiers using high-dimensional DNA microarray data

Open Access

13 April 2006

journal article
Published by Oxford University Press (OUP) in Biostatistics

Vol. 8 (1), 101-117
https://doi.org/10.1093/biostatistics/kxj036

Abstract

Many gene expression studies attempt to develop a predictor of pre-defined diagnostic or prognostic classes. If the classes are similar biologically, then the number of genes that are differentially expressed between the classes is likely to be small compared to the total number of genes measured. This motivates a two-step process for predictor development, a subset of differentially expressed genes is selected for use in the predictor and then the predictor constructed from these. Both these steps will introduce variability into the resulting classifier, so both must be incorporated in sample size estimation. We introduce a methodology for sample size determination for prediction in the context of high-dimensional data that captures variability in both steps of predictor development. The methodology is based on a parametric probability model, but permits sample size computations to be carried out in a practical manner without extensive requirements for preliminary data. We find that many prediction problems do not require a large training set of arrays for classifier development.

Keywords

This publication has 12 references indexed in Scilit:

Prediction error estimation: a comparison of resampling methods
Bioinformatics, 2005
A Multigene Assay to Predict Recurrence of Tamoxifen-Treated, Node-Negative Breast Cancer
New England Journal of Medicine, 2004
Sample size determination in microarray experiments for class comparison and prognostic classification
Biostatistics, 2004
How many samples are needed to build a classifier: a general sequential approach
Bioinformatics, 2004
A well-conditioned estimator for large-dimensional covariance matrices
Journal of Multivariate Analysis, 2004
Estimating Dataset Size Requirements for Classifying DNA Microarray Data
Journal of Computational Biology, 2003
Determination of minimum sample size and discriminatory expression patterns in microarray data
Bioinformatics, 2002
Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring
Science, 1999
Optimal Predictive Linear Discriminants
The Annals of Statistics, 1974
On Expected Probabilities of Misclassification in Discriminant Analysis, Necessary Sample Size, and a Relation with the Multiple Correlation Coefficient
Biometrics, 1968

Cited by 107 articles