Use of Wrapper Algorithms Coupled with a Random Forests Classifier for Variable Selection in Large-Scale Genomic Association Studies

1 December 2009

journal article
research article
Published by Mary Ann Liebert Inc in Journal of Computational Biology

Vol. 16 (12), 1705-1718
https://doi.org/10.1089/cmb.2008.0037

Abstract

Modern large-scale genetic association studies generate increasingly high-dimensional datasets. Therefore, some variable selection procedure should be performed before the application of traditional data analysis methods, for reasons of both computational efficiency and problems related to overfitting. We describe here a “wrapper” strategy (SIZEFIT) for variable selection that uses a Random Forests classifier, coupled with various local search/optimization algorithms. We apply it to a large dataset consisting of 2,425 African-American and non-Hispanic white individuals genotyped for 4,869 single-nucleotide polymorphisms (SNPs) in a coronary heart disease (CHD) case–cohort association study (Atherosclerosis Risk in Communities), using incident CHD and plasma low-density lipoprotein (LDL) cholesterol levels as the dependent variables. We show that most SNPs can be safely removed from the dataset without compromising the predictive (classification) accuracy, with only a small number of SNPs (sometimes less than 100) containing any predictive signal. A statistical (SUMSTAT) approach is also applied to the dataset for comparison purposes. We describe a novel method for refining the subset of signal-containing SNPs (FIXFIT), based on an Extremal Optimization algorithm. Finally, we compare the top SNP rankings obtained by different methods and devise practical guidelines for researchers trying to generate a compact subset of predictive SNPs from genome-wide association datasets. Interestingly, there is a significant amount of overlap between seemingly very heterogeneous rankings. We conclude by constructing compact optimal predictive SNP subsets for CHD (less than 150 SNPs) and LDL (less than 300 SNPs) phenotypes, and by comparing various rankings for two well-known positive control SNPs for LDL in the apolipoprotein E gene.

Keywords

This publication has 27 references indexed in Scilit:

SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation
Nature Genetics, 2008
Analysis of multiple SNPs in genetic association studies: comparison of three multi-locus methods to prioritize and select SNPs
Genetic Epidemiology, 2007
Complement Factor H Polymorphism in Age-Related Macular Degeneration
Science, 2005
Mining Genetic Epidemiology Data with Bayesian Networks Application to APOE Gene Variation and Plasma Lipid Levels
Journal of Computational Biology, 2005
Identifying SNPs predictive of phenotype using random forests
Genetic Epidemiology, 2004
Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling
Journal of Chemical Information and Computer Sciences, 2003
Mathematical multi-locus approaches to localizing complex human trait genes
Nature Reviews Genetics, 2003
Trimming, Weighting, and Grouping SNPs in Human Case-Control Association Studies
Genome Research, 2001
A Combinatorial Partitioning Method to Identify Multilocus Genotypic Partitions That Predict Quantitative Trait Variation
Genome Research, 2001
Wrappers for feature subset selection
Artificial Intelligence, 1997

Cited by 22 articles