Unsupervised Learning With Random Forest Predictors

Top Cited Papers

1 March 2006

journal article
Published by Taylor & Francis Ltd in Journal of Computational and Graphical Statistics

Vol. 15 (1), 118-138
https://doi.org/10.1198/106186006x94072

Abstract

A random forest (RF) predictor is an ensemble of individual tree predictors. As part of their construction, RF predictors naturally lead to a dissimilarity measure between the observations. One can also define an RF dissimilarity measure between unlabeled data: the idea is to construct an RF predictor that distinguishes the “observed” data from suitably generated synthetic data. The observed data are the original unlabeled data and the synthetic data are drawn from a reference distribution. Here we describe the properties of the RF dissimilarity and make recommendations on how to use it in practice. An RF dissimilarity can be attractive because it handles mixed variable types well, is invariant to monotonic transformations of the input variables, and is robust to outlying observations. The RF dissimilarity easily deals with a large number of variables due to its intrinsic variable selection; for example, the Addcl 1 RF dissimilarity weighs the contribution of each variable according to how dependent it is on other variables. We find that the RF dissimilarity is useful for detecting tumor sample clusters on the basis of tumor marker expressions. In this application, biologically meaningful clusters can often be described with simple thresholding rules.

Keywords

This publication has 8 references indexed in Scilit:

Global histone modification patterns predict risk of prostate cancer recurrence
Nature, 2005
Tumor classification by tissue microarray profiling: random forest clustering applied to renal cell carcinoma
Laboratory Investigation, 2005
High concentrations of long interspersed nuclear element sequence distinguish monoallelically expressed genes
Proceedings of the National Academy of Sciences of the United States of America, 2003
Random Forests
Machine Learning, 2001
Tree-based, Two-stage Risk Factor Analysis for Spontaneous Abortion
American Journal of Epidemiology, 1996
Comparing partitions
Journal of Classification, 1985
Objective Criteria for the Evaluation of Clustering Methods
Journal of the American Statistical Association, 1971
Nonparametric Estimation from Incomplete Observations
Journal of the American Statistical Association, 1958

Cited by 365 articles