A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification

Open Access

22 July 2008

journal article
research article
Published by Springer Science and Business Media LLC in BMC Bioinformatics

Vol. 9 (1), 319
https://doi.org/10.1186/1471-2105-9-319

Abstract

Background: Cancer diagnosis and clinical outcome prediction are among the most important emerging applications of gene expression microarray technology with several molecular signatures on their way toward clinical deployment. Use of the most accurate classification algorithms available for microarray gene expression data is a critical ingredient in order to develop the best possible molecular signatures for patient care. As suggested by a large body of literature to date, support vector machines can be considered "best of class" algorithms for classification of such data. Recent work, however, suggests that random forest classifiers may outperform support vector machines in this domain. Results: In the present paper we identify methodological biases of prior work comparing random forests and support vector machines and conduct a new rigorous evaluation of the two algorithms that corrects these limitations. Our experiments use 22 diagnostic and prognostic datasets and show that support vector machines outperform random forests, often by a large margin. Our data also underlines the importance of sound research design in benchmarking and comparison of bioinformatics algorithms. Conclusion: We found that both on average and in the majority of microarray datasets, random forests are outperformed by support vector machines both in the settings when no gene selection is performed and when several popular gene selection methods are used.

Keywords

This publication has 21 references indexed in Scilit:

Critical Review of Published Microarray Studies for Cancer Outcome and Guidelines on Statistical Analysis and Reporting
JNCI Journal of the National Cancer Institute, 2007
Multi-class feature selection for texture classification
Pattern Recognition Letters, 2006
GEMS: A system for automated cancer diagnosis and biomarker discovery from microarray gene expression data
International Journal of Medical Informatics, 2005
Using permutations instead of student's t distribution for p-values in paired-difference algorithm comparisons
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2005
An extensive comparison of recent classification tools applied to microarray data
Computational Statistics & Data Analysis, 2005
A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis
Bioinformatics, 2004
An Analytical Method for Multiclass Molecular Cancer Classification
SIAM Review, 2003
Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data
Journal of the American Statistical Association, 2002
Ensemble Methods in Machine Learning
Lecture Notes in Computer Science, 2000
Improvements on Cross-Validation: The .632+ Bootstrap Method
Journal of the American Statistical Association, 1997

Cited by 512 articles