Stability metrics for multi-source biomedical data based on simplicial projections from probability distribution distances
- 11 July 2016
- journal article
- research article
- Published by SAGE Publications in Statistical Methods in Medical Research
- Vol. 26 (1), 312-336
- https://doi.org/10.1177/0962280214545122
Abstract
Biomedical data may be composed of individuals generated from distinct, meaningful sources. Due to possible contextual biases in the processes that generate data, there may exist an undesirable and unexpected variability among the probability distribution functions (PDFs) of the source subsamples, which, when uncontrolled, may lead to inaccurate or unreproducible research results. Classical statistical methods may have difficulties to undercover such variabilities when dealing with multi-modal, multi-type, multi-variate data. This work proposes two metrics for the analysis of stability among multiple data sources, robust to the aforementioned conditions, and defined in the context of data quality assessment. Specifically, a global probabilistic deviation and a source probabilistic outlyingness metrics are proposed. The first provides a bounded degree of the global multi-source variability, designed as an estimator equivalent to the notion of normalized standard deviation of PDFs. The second provides a bounded degree of the dissimilarity of each source to a latent central distribution. The metrics are based on the projection of a simplex geometrical structure constructed from the Jensen–Shannon distances among the sources PDFs. The metrics have been evaluated and demonstrated their correct behaviour on a simulated benchmark and with real multi-source biomedical data using the UCI Heart Disease data set. The biomedical data quality assessment based on the proposed stability metrics may improve the efficiency and effectiveness of biomedical data exploitation and research.Keywords
This publication has 24 references indexed in Scilit:
- A Novel Nonparametric Distance Estimator for Densities with Error BoundsEntropy, 2013
- SHRINE: Enabling Nationally Scalable Multi-Site Disease StudiesPLOS ONE, 2013
- Intercenter differences in diffusion tensor MRI acquisitionJournal of Magnetic Resonance Imaging, 2010
- A Method for Selecting the Bin Size of a Time HistogramNeural Computation, 2007
- A new metric for probability distributionsIEEE Transactions on Information Theory, 2003
- Data signatures and visualization of scientific data setsIEEE Computer Graphics and Applications, 2000
- Intercenter Agreement in Reading Doppler Embolic SignalsStroke, 1997
- Divergence measures based on the Shannon entropyIEEE Transactions on Information Theory, 1991
- International application of a new probability algorithm for the diagnosis of coronary artery diseaseThe American Journal of Cardiology, 1989
- On Estimation of a Probability Density Function and ModeThe Annals of Mathematical Statistics, 1962