A method for generating synthetic longitudinal health data
Open Access
- 23 March 2023
- journal article
- research article
- Published by Springer Science and Business Media LLC in BMC Medical Research Methodology
- Vol. 23 (1), 1-21
- https://doi.org/10.1186/s12874-023-01869-w
Abstract
Getting access to administrative health data for research purposes is a difficult and time-consuming process due to increasingly demanding privacy regulations. An alternative method for sharing administrative health data would be to share synthetic datasets where the records do not correspond to real individuals, but the patterns and relationships seen in the data are reproduced. This paper assesses the feasibility of generating synthetic administrative health data using a recurrent deep learning model. Our data comes from 120,000 individuals from Alberta Health’s administrative health database. We assess how similar our synthetic data is to the real data using utility assessments that assess the structure and general patterns in the data as well as by recreating a specific analysis in the real data commonly applied to this type of administrative health data. We also assess the privacy risks associated with the use of this synthetic dataset. Generic utility assessments that used Hellinger distance to quantify the difference in distributions between real and synthetic datasets for event types (0.027), attributes (mean 0.0417), Markov transition matrices (order 1 mean absolute difference: 0.0896, sd: 0.159; order 2: mean Hellinger distance 0.2195, sd: 0.2724), the Hellinger distance between the joint distributions was 0.352, and the similarity of random cohorts generated from real and synthetic data had a mean Hellinger distance of 0.3 and mean Euclidean distance of 0.064, indicating small differences between the distributions in the real data and the synthetic data. By applying a realistic analysis to both real and synthetic datasets, Cox regression hazard ratios achieved a mean confidence interval overlap of 68% for adjusted hazard ratios among 5 key outcomes of interest, indicating synthetic data produces similar analytic results to real data. The privacy assessment concluded that the attribution disclosure risk associated with this synthetic dataset was substantially less than the typical 0.09 acceptable risk threshold. Based on these metrics our results show that our synthetic data is suitably similar to the real data and could be shared for research purposes thereby alleviating concerns associated with the sharing of real data in some circumstances.Keywords
Funding Information
- Replica Analytics Ltd.
- Bill and Melinda Gates Foundation
- Canadian Institutes of Health Research
- Natural Sciences and Engineering Research Council of Canada
- Canada Research Chairs
- Mitacs
- Alberta Innovates
- Health Cities, Edmonton, Canada
- Institute for Health Economics, Canada
This publication has 62 references indexed in Scilit:
- Prognostic score–based balance measures can be a useful diagnostic for propensity score methods in comparative effectiveness researchJournal of Clinical Epidemiology, 2013
- Unique in the Crowd: The privacy bounds of human mobilityScientific Reports, 2013
- Matching Known Patients to Health Records in Washington State DataSSRN Electronic Journal, 2013
- An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasetsComputational Statistics & Data Analysis, 2011
- Data-driven approach for creating synthetic electronic medical recordsBMC Medical Informatics and Decision Making, 2010
- Incremental Tree-Based Missing Data Imputation with Lexicographic OrderingJournal of Classification, 2009
- A Framework for Evaluating the Utility of Data Altered to Protect ConfidentialityThe American Statistician, 2006
- Releasing Multiply Imputed, Synthetic Public use Microdata: An Illustration and Empirical StudyJournal of the Royal Statistical Society Series A: Statistics in Society, 2004
- IntroductionPublished by Springer Science and Business Media LLC ,2000
- Long Short-Term MemoryNeural Computation, 1997