Generating high-fidelity synthetic patient data for assessing machine learning healthcare software

Open Access

9 November 2020

journal article
research article
Published by Springer Science and Business Media LLC in npj Digital Medicine

Vol. 3 (1), 1-13
https://doi.org/10.1038/s41746-020-00353-9

Abstract

There is a growing demand for the uptake of modern artificial intelligence technologies within healthcare systems. Many of these technologies exploit historical patient health data to build powerful predictive models that can be used to improve diagnosis and understanding of disease. However, there are many issues concerning patient privacy that need to be accounted for in order to enable this data to be better harnessed by all sectors. One approach that could offer a method of circumventing privacy issues is the creation of realistic synthetic data sets that capture as many of the complexities of the original data set (distributions, non-linear relationships, and noise) but that does not actually include any real patient data. While previous research has explored models for generating synthetic data sets, here we explore the integration of resampling, probabilistic graphical modelling, latent variable identification, and outlier analysis for producing realistic synthetic data based on UK primary care patient data. In particular, we focus on handling missingness, complex interactions between variables, and the resulting sensitivity analysis statistics from machine learning classifiers, while quantifying the risks of patient re-identification from synthetic datapoints. We show that, through our approach of integrating outlier analysis with graphical modelling and resampling, we can achieve synthetic data sets that are not significantly different from original ground truth data in terms of feature distributions, feature dependencies, and sensitivity analysis statistics when inferring machine learning classifiers. What is more, the risk of generating synthetic data that is identical or very similar to real patients is shown to be low.

Keywords

Funding Information

Innovate UK
Regulators’ Pioneer Fund, The Department for Business, Energy and Industrial Strategy (BEIS), administered by Innovate UK

This publication has 46 references indexed in Scilit:

Migraine and psychiatric comorbidity: a review of clinical findings
The Journal of Headache and Pain, 2011
Contemporary treatment of systemic lupus erythematosus: an update for clinicians
Therapeutic Advances in Chronic Disease, 2010
Epidemiology of Chronic Kidney Disease in Heart Failure
Heart Failure Clinics, 2008
Predicting cardiovascular risk in England and Wales: prospective derivation and validation of QRISK2
BMJ, 2008
Exploiting missing clinical data in Bayesian network modeling for predicting medical problems
Journal of Biomedical Informatics, 2008
Blood pressure and ageing
Postgraduate Medical Journal, 2007
Usefulness of total cholesterol/HDL‐cholesterol ratio in the management of diabetic dyslipidaemia
Diabetic Medicine, 2004
Association Between Smoking and Blood Pressure
Hypertension, 2001
A tutorial on hidden Markov models and selected applications in speech recognition
Proceedings of the IEEE, 1989
Estimating the Dimension of a Model
The Annals of Statistics, 1978

Cited by 93 articles