Pursuing sources of heterogeneity in modeling clustered population
- 2 February 2021
- journal article
- research article
- Published by Oxford University Press (OUP) in Biometrics
- Vol. 78 (2), 716-729
- https://doi.org/10.1111/biom.13434
Abstract
Researchers often have to deal with heterogeneous population with mixed regression relationships, increasingly so in the era of data explosion. In such problems, when there are many candidate predictors, it is not only of interest to identify the predictors that are associated with the outcome, but also to distinguish the true sources of heterogeneity, that is, to identify the predictors that have different effects among the clusters and thus are the true contributors to the formation of the clusters. We clarify the concepts of the source of heterogeneity that account for potential scale differences of the clusters and propose a regularized finite mixture effects regression to achieve heterogeneity pursuit and feature selection simultaneously. We develop an efficient algorithm and show that our approach can achieve both estimation and selection consistency. Simulation studies further demonstrate the effectiveness of our method under various practical scenarios. Three applications are presented, namely, an imaging genetics study for linking genetic factors and brain neuroimaging traits in Alzheimer's disease, a public health study for exploring the association between suicide risk among adolescents and their school district characteristics, and a sport analytics study for understanding how the salary levels of baseball players are associated with their performance and contractual status.Keywords
Funding Information
- National Natural Science Foundation of China (11661038)
- U.S. Department of Energy (10006272)
This publication has 44 references indexed in Scilit:
- Sparse reduced-rank regression detects genetic associations with voxel-wise longitudinal phenotypes in Alzheimer's diseaseNeuroImage, 2012
- The solution path of the generalized lassoThe Annals of Statistics, 2011
- Common variants at ABCA7, MS4A6A/MS4A4E, EPHA1, CD33 and CD2AP are associated with Alzheimer's diseaseNature Genetics, 2011
- Sparse regression with exact clusteringElectronic Journal of Statistics, 2010
- Variable Selection in Finite Mixture of Regression ModelsJournal of the American Statistical Association, 2007
- The Adaptive Lasso and Its Oracle PropertiesJournal of the American Statistical Association, 2006
- Regularization and Variable Selection Via the Elastic NetJournal of the Royal Statistical Society Series B: Statistical Methodology, 2005
- Hierarchical mixtures-of-experts for exponential family regression models: approximation and maximum likelihood estimationThe Annals of Statistics, 1999
- A mixture likelihood approach for generalized linear modelsJournal of Classification, 1995
- Using EM to Obtain Asymptotic Variance-Covariance Matrices: The SEM AlgorithmJournal of the American Statistical Association, 1991