Robust discovery of local patterns

28 January 2012

conference paper
conference paper
Published by Association for Computing Machinery (ACM)

p. 265-274
https://doi.org/10.1145/2110363.2110395

Abstract

The identification of unanticipated statistical associations is a core activity in exploratory analysis of high-dimensional biomedical data. Specifically, post-marketing surveillance for harmful effects of medicines relies on effective algorithms to detect associations between drugs and suspected adverse drug reactions. The WHO global individual case safety reports database, VigiBase, holds over six million reports and covers more than ten thousand medicinal products and thousands of distinct medical concepts. It collects data from more than 100 countries across the world and its first reports date back to the late 1960s. Local patterns may not show in database-wide analyses, and many others will vary substantially in strength or direction across data subsets. Still, routine screening of this and similar databases relies on global measures of association. In this paper, we propose a framework to detect local associations and characterise subset variability in high-dimensional data. We use shrinkage observed-to-expected ratios and employ multiple stratification by one or two covariates at a time. We consider subset-specific, stratified-then-pooled adjusted measures, and a novel measure to detect associations that hold in all-but-one subset. We use covariate permutation to select stratification covariates and gauge the vulnerability to spurious associations. Chance findings are a major concern! A naive subgroup analysis yielded more than 50% spurious local associations in VigiBase. To improve on this, we enforce conservative credibility intervals and also look for subset-specific associations that reproduce in at least one additional subset (e.g. two time periods). In addition to 119,500 global associations between drugs and medical events in VigiBase, such robust subgroup analysis uncovered 14,600 local associations at an estimated rate of 2.2% spurious.

Keywords

This publication has 12 references indexed in Scilit:

Shrinkage observed-to-expected ratios for robust and transparent large-scale pattern discovery
Statistical Methods in Medical Research, 2011
Temporal pattern discovery in longitudinal electronic patient records
Data Mining and Knowledge Discovery, 2009
A statistical methodology for drug–drug interaction surveillance
Statistics in Medicine, 2008
Discovering Significant Patterns
Machine Learning, 2007
Duplicate detection in adverse drug reaction surveillance
Data Mining and Knowledge Discovery, 2007
Mining risk patterns in medical data
Published by Association for Computing Machinery (ACM) ,2005
Pattern Discovery and Detection: A Unified Statistical Methodology
Journal of Applied Statistics, 2004
Empirical bayes screening for multi-item associations
Published by Association for Computing Machinery (ACM) ,2001
Bayesian neural networks with confidence estimations applied to data mining
Computational Statistics & Data Analysis, 2000
Efficient mining of emerging patterns
Published by Association for Computing Machinery (ACM) ,1999

Cited by 20 articles