Bias in random forest variable importance measures: Illustrations, sources and a solution

Top Cited Papers

Open Access

25 January 2007

journal article
research article
Published by Springer Science and Business Media LLC in BMC Bioinformatics

Vol. 8 (1), 25
https://doi.org/10.1186/1471-2105-8-25

Abstract

Background: Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types, for example when predictors include both sequence data and continuous variables such as folding energy, or when amino acid sequence data show different numbers of categories. Results: Simulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand. Conclusion: We propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analyzing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research.

Keywords

This publication has 24 references indexed in Scilit:

SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation
Nature Genetics, 2008
Unbiased Recursive Partitioning: A Conditional Inference Framework
Journal of Computational and Graphical Statistics, 2006
Short‐term prediction of mortality in patients with systemic lupus erythematosus: Classification of outcomes using random forests
Arthritis Care & Research, 2006
Tumor classification by tissue microarray profiling: random forest clustering applied to renal cell carcinoma
Laboratory Investigation, 2005
Identifying SNPs predictive of phenotype using random forests
Genetic Epidemiology, 2004
Development of Linear, Ensemble, and Nonlinear Models for the Prediction and Interpretation of the Biological Activity of a Set of PDGFR Inhibitors
Journal of Chemical Information and Computer Sciences, 2004
Relating HIV-1 Sequence Variation to Replication Capacity via Trees and Forests
Statistical Applications in Genetics and Molecular Biology, 2004
Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling
Journal of Chemical Information and Computer Sciences, 2003
Analyzing bagging
The Annals of Statistics, 2002
Classification Trees With Unbiased Multiway Splits
Journal of the American Statistical Association, 2001

Cited by 2400 articles