Boosting: An Ensemble Learning Tool for Compound Classification and QSAR Modeling

Abstract
A classification and regression tool, J. H. Friedman's Stochastic Gradient Boosting (SGB), is applied to predicting a compound's quantitative or categorical biological activity based on a quantitative description of the compound's molecular structure. Stochastic Gradient Boosting is a procedure for building a sequence of models, for instance regression trees (as in this paper), whose outputs are combined to form a predicted quantity, either an estimate of the biological activity, or a class label to which a molecule belongs. In particular, the SGB procedure builds a model in a stage-wise manner by fitting each tree to the gradient of a loss function: e.g., squared error for regression and binomial log-likelihood for classification. The values of the gradient are computed for each sample in the training set, but only a random sample of these gradients is used at each stage. (Friedman showed that the well-known boosting algorithm, AdaBoost of Freund and Schapire, could be considered as a particular case of SGB.) The SGB method is used to analyze 10 cheminformatics data sets, most of which are publicly available. The results show that SGB's performance is comparable to that of Random Forest, another ensemble learning method, and are generally competitive with or superior to those of other QSAR methods. The use of SGB's variable importance with partial dependence plots for model interpretation is also illustrated.