Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling
Top Cited Papers
- 1 November 2003
- journal article
- research article
- Published by American Chemical Society (ACS) in Journal of Chemical Information and Computer Sciences
- Vol. 43 (6), 1947-1958
- https://doi.org/10.1021/ci034160g
Abstract
A new classification and regression tool, Random Forest, is introduced and investigated for predicting a compound's quantitative or categorical biological activity based on a quantitative description of the compound's molecular structure. Random Forest is an ensemble of unpruned classification or regression trees created by using bootstrap samples of the training data and random feature selection in tree induction. Prediction is made by aggregating (majority vote or averaging) the predictions of the ensemble. We built predictive models for six cheminformatics data sets. Our analysis demonstrates that Random Forest is a powerful tool capable of delivering performance that is among the most accurate methods to date. We also present three additional features of Random Forest: built-in performance assessment, a measure of relative importance of descriptors, and a measure of compound similarity that is weighted by the relative importance of descriptors. It is the combination of relatively high prediction accuracy and its collection of desired features that makes Random Forest uniquely suited for modeling in cheminformatics.Keywords
This publication has 22 references indexed in Scilit:
- SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivationNature Genetics, 2008
- Partial least squares for discriminationJournal of Chemometrics, 2003
- Decision Forest: Combining the Predictions of Multiple Independent Decision Tree ModelsJournal of Chemical Information and Computer Sciences, 2003
- Assessing Model Fit by Cross-ValidationJournal of Chemical Information and Computer Sciences, 2003
- Greedy function approximation: A gradient boosting machine.The Annals of Statistics, 2001
- Chemical Similarity SearchingJournal of Chemical Information and Computer Sciences, 1998
- Arcing classifier (with discussion and a rejoinder by the author)The Annals of Statistics, 1998
- A Decision-Theoretic Generalization of On-Line Learning and an Application to BoostingJournal of Computer and System Sciences, 1997
- Use of Structure−Activity Data To Compare Structure-Based Clustering Methods and Descriptors for Use in Compound SelectionJournal of Chemical Information and Computer Sciences, 1996
- Atom pairs as molecular features in structure-activity studies: definition and applicationsJournal of Chemical Information and Computer Sciences, 1985