Robustness of random forests for regression

13 September 2012

journal article
research article
Published by Taylor & Francis Ltd in Journal of Nonparametric Statistics

Vol. 24 (4), 993-1006
https://doi.org/10.1080/10485252.2012.715161

Abstract

In this paper, we empirically investigate the robustness of random forests for regression problems. We also investigate the performance of six variations of the original random forest method, all aimed at improving robustness. These variations are based on three main ideas: (1) robustify the aggregation method, (2) robustify the splitting criterion and (3) taking a robust transformation of the response. More precisely, with the first idea, we use the median (or weighted median), instead of the mean, to combine the predictions from the individual trees. With the second idea, we use least-absolute deviations from the median, instead of least-squares, as splitting criterion. With the third idea, we build the trees using the ranks of the response instead of the original values. The competing methods are compared via a simulation study with artificial data using two different types of contaminations and also with 13 real data sets. Our results show that all three ideas improve the robustness of the original random forest algorithm. However, a robust aggregation of the individual trees is generally more profitable than a robust splitting criterion.

Keywords

This publication has 10 references indexed in Scilit:

Mining data with random forests: A survey and results of new tests
Pattern Recognition, 2011
On the layered nearest neighbour estimate, the bagged nearest neighbour estimate and the random forest method in regression and classification
Journal of Multivariate Analysis, 2010
Alternative methods of predicting competitive events: An application in horserace betting markets
International Journal of Forecasting, 2010
TumorBoost: Normalization of allele-specific tumor copy numbers from a single pair of tumor-normal genotyping microarrays
BMC Bioinformatics, 2010
The behaviour of random forest permutation-based variable importance measures under predictor correlation
BMC Bioinformatics, 2010
Taxonomy for characterizing ensemble methods in classification tasks: A review and annotated bibliography
Computational Statistics & Data Analysis, 2009
Navigating Random Forests and related advances in algorithmic modeling
Statistics Surveys, 2008
Random forests as a tool for ecohydrological distribution modelling
Ecological Modelling, 2007
An empirical comparison of ensemble methods based on classification trees
Journal of Statistical Computation and Simulation, 2005
Random Forests
Machine Learning, 2001

Cited by 53 articles