External Validation and Prediction Employing the Predictive Squared Correlation Coefficient — Test Set Activity Mean vs Training Set Activity Mean

28 October 2008

journal article
research article
Published by American Chemical Society (ACS) in Journal of Chemical Information and Modeling

Vol. 48 (11), 2140-2145
https://doi.org/10.1021/ci800253u

Abstract

The external prediction capability of quantitative structure−activity relationship (QSAR) models is often quantified using the predictive squared correlation coefficient, q². This index relates the predictive residual sum of squares, PRESS, to the activity sum of squares, SS, without postprocessing of the model output, the latter of which is automatically done when calculating the conventional squared correlation coefficient, r². According to the current OECD guidelines, q² for external validation should be calculated with SS referring to the training set activity mean. Our present findings including a mathematical proof demonstrate that this approach yields a systematic overestimation of the prediction capability that is triggered by the difference between the training and test set activity means. Example calculations with three regression models and data sets taken from literature show further that for external test sets, q² based on the training set activity mean may become even larger than r². As a consequence, we suggest to always use the test set activity mean when quantifying the external prediction capability through q² and to revise the respective OECD guidance document accordingly. The discussion includes a comparison between r² and q² value ranges and the q² statistics for cross-validation.

Keywords

This publication has 28 references indexed in Scilit:

Estimation of Compartmental Half‐lives of Organic Compounds – Structural Similarity versus EPI‐Suite
QSAR & Combinatorial Science, 2007
The quality of QSAR models: problems and solutions
SAR and QSAR in Environmental Research, 2007
Determining the Validity of a QSAR Model − A Classification Approach
Journal of Chemical Information and Modeling, 2004
Similarity to Molecules in the Training Set Is a Good Discriminator for Prediction Accuracy in QSAR
Journal of Chemical Information and Computer Sciences, 2004
Linear Model Selection by Cross-validation
Journal of the American Statistical Association, 1993
On the prediction of soil sorption coefficients of organic pollutants from molecular structure: application of molecular topology model
Environmental Science & Technology, 1987
Better Bootstrap Confidence Intervals
Journal of the American Statistical Association, 1987
An efficient UMP2 program
Journal of Computational Chemistry, 1985
Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation
Journal of the American Statistical Association, 1983
The Relationship Between Variable Selection and Data Agumentation and a Method for Prediction
Technometrics, 1974

Cited by 453 articles