Variable importance in binary regression trees and forests

Open Access

31 December 2006

journal article
research article
Published by Institute of Mathematical Statistics in Electronic Journal of Statistics

Vol. 1 (none), 519-537
https://doi.org/10.1214/07-EJS039

Abstract

We characterize and study variable importance (VIMP) and pairwise variable associations in binary regression trees. A key component involves the node mean squared error for a quantity we refer to as a maximal subtree. The theory naturally extends from single trees to ensembles of trees and applies to methods like random forests. This is useful because while importance values from random forests are used to screen variables, for example they are used to filter high throughput genomic data in Bioinformatics, very little theory exists about their properties.

Keywords

This publication has 6 references indexed in Scilit:

SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation
Nature Genetics, 2008
Bias in random forest variable importance measures: Illustrations, sources and a solution
BMC Bioinformatics, 2007
Gene selection and classification of microarray data using random forest
BMC Bioinformatics, 2006
Identifying SNPs predictive of phenotype using random forests
Genetic Epidemiology, 2004
Screening large-scale association study data: exploiting interactions using random forests
BMC Genetics, 2004
Statistical modeling: The two cultures
Statistical Science, 2001

Cited by 303 articles