On the overestimation of random forest's out-of-bag error

Open Access

6 August 2018

journal article
research article
Published by Public Library of Science (PLoS) in PLOS ONE

Vol. 13 (8), e0201904
https://doi.org/10.1371/journal.pone.0201904

Abstract

The ensemble method random forests has become a popular classification tool in bioinformatics and related fields. The out-of-bag error is an error estimation technique often used to evaluate the accuracy of a random forest and to select appropriate values for tuning parameters, such as the number of candidate predictors that are randomly drawn for a split, referred to as mtry. However, for binary classification problems with metric predictors it has been shown that the out-of-bag error can overestimate the true prediction error depending on the choices of random forests parameters. Based on simulated and real data this paper aims to identify settings for which this overestimation is likely. It is, moreover, questionable whether the out-of-bag error can be used in classification tasks for selecting tuning parameters like mtry, because the overestimation is seen to depend on the parameter mtry. The simulation-based and real-data based studies with metric predictor variables performed in this paper show that the overestimation is largest in balanced settings and in settings with few observations, a large number of predictor variables, small correlations between predictors and weak effects. There was hardly any impact of the overestimation on tuning parameter selection. However, although the prediction performance of random forests was not substantially affected when using the out-of-bag error for tuning parameter selection in the present studies, one cannot be sure that this applies to all future data. For settings with metric predictor variables it is therefore strongly recommended to use stratified subsampling with sampling fractions that are proportional to the class sizes for both tuning parameter selection and error estimation in random forests. This yielded less biased estimates of the true prediction error. In unbalanced settings, in which there is a strong interest in predicting observations from the smaller classes well, sampling the same number of observations from each class is a promising alternative.

Funding Information

Deutsche Forschungsgemeinschaft (BO3139/6-1)
Deutsche Forschungsgemeinschaft (BO3139/6-1)
Deutsche Forschungsgemeinschaft (BO3139/2-2)

This publication has 39 references indexed in Scilit:

Random forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations
Briefings in Bioinformatics, 2011
Letter to the Editor: On the stability and ranking of predictors from random forest variable importance measures
Briefings in Bioinformatics, 2011
Random Forests for Genetic Association Studies
Statistical Applications in Genetics and Molecular Biology, 2011
An application of Random Forests to a genome-wide association dataset: Methodological considerations & new findings
BMC Genetics, 2010
Evidence of statistical epistasis between DISC1, CIT and NDEL1 impacting risk for schizophrenia: biological validation with functional neuroimaging
Human Genetics, 2010
Predictor correlation impacts machine learning algorithms: implications for genomic studies
Bioinformatics, 2009
Discovery of agents that eradicate leukemia stem cells using an in silico screen of public gene expression data
Blood, 2008
Unbiased Recursive Partitioning: A Conditional Inference Framework
Journal of Computational and Graphical Statistics, 2006
When is a genomic classifier ready for prime time?
Nature Clinical Practice Oncology, 2004
Gene expression profiling predicts clinical outcome of breast cancer
Nature, 2002

Cited by 154 articles