Using cross-validation to evaluate predictive accuracy of survival risk classifiers based on high-dimensional data

Open Access

15 February 2011

journal article
research article
Published by Oxford University Press (OUP) in Briefings in Bioinformatics

Vol. 12 (3), 203-214
https://doi.org/10.1093/bib/bbr001

Abstract

Developments in whole genome biotechnology have stimulated statistical focus on prediction methods. We review here methodology for classifying patients into survival risk groups and for using cross-validation to evaluate such classifications. Measures of discrimination for survival risk models include separation of survival curves, time-dependent ROC curves and Harrell’s concordance index. For high-dimensional data applications, however, computing these measures as re-substitution statistics on the same data used for model development results in highly biased estimates. Most developments in methodology for survival risk modeling with high-dimensional data have utilized separate test data sets for model evaluation. Cross-validation has sometimes been used for optimization of tuning parameters. In many applications, however, the data available are too limited for effective division into training and test sets and consequently authors have often either reported re-substitution statistics or analyzed their data using binary classification methods in order to utilize familiar cross-validation. In this article we have tried to indicate how to utilize cross-validation for the evaluation of survival risk models; specifically how to compute cross-validated estimates of survival distributions for predicted risk groups and how to compute cross-validated time-dependent ROC curves. We have also discussed evaluation of the statistical significance of a survival risk model and evaluation of whether high-dimensional genomic data adds predictive accuracy to a model based on standard covariates alone.

Keywords

This publication has 27 references indexed in Scilit:

An evaluation of resampling methods for assessment of survival risk prediction in high‐dimensional settings
Statistics in Medicine, 2010
Gene Expression-Based Prognostic Signatures in Lung Cancer: Ready for Clinical Use?
JNCI Journal of the National Cancer Institute, 2010
Testing the additional predictive value of high-dimensional molecular data
BMC Bioinformatics, 2010
Survival prediction from clinico-genomic models - a comparative study
BMC Bioinformatics, 2009
Gene expression–based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study
Nature Medicine, 2008
Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models
BMC Bioinformatics, 2008
Critical Review of Published Microarray Studies for Cancer Outcome and Guidelines on Statistical Analysis and Reporting
JNCI Journal of the National Cancer Institute, 2007
Sample size planning for developing classifiers using high-dimensional DNA microarray data
Biostatistics, 2006
Microarray gene expression data with linked survival phenotypes: diffuse large-B-cell lymphoma revisited
Biostatistics, 2005
Semi-Supervised Methods to Predict Patient Survival from Gene Expression Data
PLoS Biology, 2004

Cited by 167 articles