The revival of the Gini importance?

Top Cited Papers

Open Access

10 May 2018

journal article
research article
Published by Oxford University Press (OUP) in Bioinformatics

Vol. 34 (21), 3711-3718
https://doi.org/10.1093/bioinformatics/bty373

Abstract

Random forests are fast, flexible and represent a robust approach to analyze high dimensional data. A key advantage over alternative machine learning algorithms are variable importance measures, which can be used to identify relevant features or perform variable selection. Measures based on the impurity reduction of splits, such as the Gini importance, are popular because they are simple and fast to compute. However, they are biased in favor of variables with many possible split points and high minor allele frequency. We set up a fast approach to debias impurity-based variable importance measures for classification, regression and survival forests. We show that it creates a variable importance measure which is unbiased with regard to the number of categories and minor allele frequency and almost as fast as the standard impurity importance. As a result, it is now possible to compute reliable importance estimates without the extra computing cost of permutations. Further, we combine the importance measure with a fast testing procedure, producing p-values for variable importance with almost no computational overhead to the creation of the random forest. Applications to gene expression and genome-wide association data show that the proposed method is powerful and computationally efficient. The procedure is included in the ranger package, available at https://cran.r-project.org/package=ranger and https://github.com/imbs-hl/ranger. Supplementary data are available at Bioinformatics online.

Keywords

Funding Information

Deutsche Forschungsgemeinschaft (CRU303 Z2, FOR2488 P7, KO2250/5-1)

This publication has 32 references indexed in Scilit:

An integrated approach to reduce the impact of minor allele frequency and linkage disequilibrium on variable importance measures for genome-wide data
Bioinformatics, 2012
Random forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations
Briefings in Bioinformatics, 2011
Letter to the Editor: On the stability and ranking of predictors from random forest variable importance measures
Briefings in Bioinformatics, 2011
Random Forests for Genetic Association Studies
Statistical Applications in Genetics and Molecular Biology, 2011
The behaviour of random forest permutation-based variable importance measures under predictor correlation
BMC Bioinformatics, 2010
Predictor correlation impacts machine learning algorithms: implications for genomic studies
Bioinformatics, 2009
Genetic Control of Human Brain Transcript Expression in Alzheimer Disease
American Journal of Human Genetics, 2009
Conditional variable importance for random forests
BMC Bioinformatics, 2008
Unbiased Recursive Partitioning: A Conditional Inference Framework
Journal of Computational and Graphical Statistics, 2006
Gene expression profiling predicts clinical outcome of breast cancer
Nature, 2002

Cited by 415 articles