EM-random forest and new measures of variable importance for multi-locus quantitative trait linkage analysis
Open Access
- 21 May 2008
- journal article
- research article
- Published by Oxford University Press (OUP) in Bioinformatics
- Vol. 24 (14), 1603-1610
- https://doi.org/10.1093/bioinformatics/btn239
Abstract
Motivation: We developed an EM-random forest (EMRF) for Haseman–Elston quantitative trait linkage analysis that accounts for marker ambiguity and weighs each sib-pair according to the posterior identical by descent (IBD) distribution. The usual random forest (RF) variable importance (VI) index used to rank markers for variable selection is not optimal when applied to linkage data because of correlation between markers. We define new VI indices that borrow information from linked markers using the correlation structure inherent in IBD linkage data. Results: Using simulations, we find that the new VI indices in EMRF performed better than the original RF VI index and performed similarly or better than EM-Haseman–Elston regression LOD score for various genetic models. Moreover, tree size and markers subset size evaluated at each node are important considerations in RFs. Availability: The source code for EMRF written in C is available at www.infornomics.utoronto.ca/downloads/EMRF Contact:bull@mshri.on.ca Supplementary information: Supplementary data are available at www.infornomics.utoronto.ca/downloads/EMRFKeywords
This publication has 33 references indexed in Scilit:
- Two‐level Haseman‐Elston regression for general pedigree data analysisGenetic Epidemiology, 2005
- Tumor classification by tissue microarray profiling: random forest clustering applied to renal cell carcinomaLaboratory Investigation, 2005
- Identifying SNPs predictive of phenotype using random forestsGenetic Epidemiology, 2004
- Quantitative trait linkage analysis by generalized estimating equations: Unification of variance components and Haseman‐Elston regressionGenetic Epidemiology, 2004
- Relating HIV-1 Sequence Variation to Replication Capacity via Trees and ForestsStatistical Applications in Genetics and Molecular Biology, 2004
- Multilevel modeling for the analysis of longitudinal blood pressure data in the Framingham Heart Study pedigreesBMC Genomic Data, 2003
- Mapping complex traits using Random ForestsBMC Genetics, 2003
- Evidence for a Gene Influencing Blood Pressure on Chromosome 17Hypertension, 2000
- A Simulation Study of the Effects of Assignment of Prior Identity-by-Descent Probabilities to Unselected Sib Pairs, in Covariance-Structure Modeling of a Quantitative-Trait LocusAmerican Journal of Human Genetics, 1999
- The investigation of linkage between a quantitative trait and a marker locusBehavior Genetics, 1972