Cross-Validated Variable Selection in Tree-Based Methods Improves Predictive Performance

7 December 2016

journal article
research article
Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Pattern Analysis and Machine Intelligence

Vol. 39 (11), 2142-2153
https://doi.org/10.1109/tpami.2016.2636831

Abstract

Recursive partitioning methods producing tree-like models are a long standing staple of predictive modeling. However, a fundamental flaw in the partitioning (or splitting) rule of commonly used tree building methods precludes them from treating different types of variables equally. This most clearly manifests in these methods' inability to properly utilize categorical variables with a large number of categories, which are ubiquitous in the new age of big data. We propose a framework to splitting using leave-one-out (LOO) cross validation (CV) for selecting the splitting variable, then performing a regular split (in our case, following CART's approach) for the selected variable. The most important consequence of our approach is that categorical variables with many categories can be safely used in tree building and are only chosen if they contribute to predictive power. We demonstrate in extensive simulation and real data analysis that our splitting approach significantly improves the performance of both single tree models and ensemble methods that utilize trees. Importantly, we design an algorithm for LOO splitting variable selection which under reasonable assumptions does not substantially increase the overall computational complexity compared to CART for two-class classification.

Funding Information

Israel Science Foundation (1487/12)
Israeli Ministry of Immigration to Amichai Painsky

This publication has 16 references indexed in Scilit:

A survey of cross-validation procedures for model selection
Statistics Surveys, 2010
SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation
Nature Genetics, 2008
Unbiased Recursive Partitioning: A Conditional Inference Framework
Journal of Computational and Graphical Statistics, 2006
Classification Trees With Bivariate Linear Discriminant Node Models
Journal of Computational and Graphical Statistics, 2003
Greedy function approximation: A gradient boosting machine.
The Annals of Statistics, 2001
Classification Trees With Unbiased Multiway Splits
Journal of the American Statistical Association, 2001
STATLOG: COMPARISON OF CLASSIFICATION ALGORITHMS ON LARGE REAL-WORLD PROBLEMS
Applied Artificial Intelligence, 1995
A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods
Biometrika, 1989
Tree-Structured Classification via Generalized Discriminant Analysis
Journal of the American Statistical Association, 1988
An Exploratory Technique for Investigating Large Quantities of Categorical Data
Journal of the Royal Statistical Society Series C: Applied Statistics, 1980

Cited by 39 articles