Brief review of regression‐based and machine learning methods in genetic epidemiology: the Genetic Analysis Workshop 17 experience

1 January 2011

journal article
review article
Published by Wiley in Genetic Epidemiology

Vol. 35 (S1), S5-S11
https://doi.org/10.1002/gepi.20642

Abstract

Genetics Analysis Workshop 17 provided common and rare genetic variants from exome sequencing data and simulated binary and quantitative traits in 200 replicates. We provide a brief review of the machine learning and regression‐based methods used in the analyses of these data. Several regression and machine learning methods were used to address different problems inherent in the analyses of these data, which are high‐dimension, low‐sample‐size data typical of many genetic association studies. Unsupervised methods, such as cluster analysis, were used for data segmentation and, subset selection. Supervised learning methods, which include regression‐based methods (e.g., generalized linear models, logic regression, and regularized regression) and tree‐based methods (e.g., decision trees and random forests), were used for variable selection (selecting genetic and clinical features most associated or predictive of outcome) and prediction (developing models using common and rare genetic variants to accurately predict outcome), with the outcome being case‐control status or quantitative trait value. We include a discussion of cross‐validation for model selection and assessment, and a description of available software resources for these methods. Genet. Epidemiol. 35:S5–S11, 2011.

Keywords

This publication has 30 references indexed in Scilit:

Genetic Analysis Workshop 17 mini-exome simulation
BMC Proceedings, 2011
Lessons learned from Genetic Analysis Workshop 17: transitioning from genome‐wide association studies to whole‐genome statistical genetic analysis
Genetic Epidemiology, 2011
On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data
Bioinformatics, 2010
Multigenic Modeling of Complex Disease by Random Forests
Published by Elsevier BV ,2010
The WEKA data mining software
ACM SIGKDD Explorations Newsletter, 2009
Genome-wide association analysis by lasso penalized logistic regression
Bioinformatics, 2009
GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest
BMC Bioinformatics, 2007
Zero-Inflated Poisson Regression, with an Application to Defects in Manufacturing
Technometrics, 1992
Algorithm AS 136: A K-Means Clustering Algorithm
Journal of the Royal Statistical Society Series C: Applied Statistics, 1979
Ridge Regression: Biased Estimation for Nonorthogonal Problems
Technometrics, 1970

Cited by 108 articles