Compressive Big Data Analytics: An Ensemble Meta-Algorithm for High-dimensional Multisource Datasets

Open Access

20 January 2020

preprint content
Published by Cold Spring Harbor Laboratory

https://doi.org/10.1101/2020.01.20.912485

Abstract

Health advances are contingent on continuous development of new methods and approaches to foster data driven discovery in the biomedical and clinical health sciences. Open-science offers hope for tackling some of the challenges associated with Big Data and team-based scientific discovery. Domain-independent reproducibility, area-specific replicability, curation, analysis, organization, management and sharing of health-related digital objects are critical components.This study expands the functionality and utility of an ensemble semi-supervised machine learning technique called Compressive Big Data Analytics (CBDA). Applied to high-dimensional data, CBDA identifies salient features and key biomarkers for reliable and reproducible forecasting of binary or multinomial outcomes. The method relies on iterative subsampling, combines function optimization and statistical inference, and generates ensemble predictions of observed univariate outcomes. In this manuscript, we extend the CBDA technique by (1) efficiently handling extremely large datasets, (2) generalizing the internal and external validation steps, (3) expanding the set of base-learners for joint ensemble prediction, (4) introduce an automated selection of CBDA specifications, and (5) provide mechanisms to assess CBDA convergence, evaluate the prediction accuracy, and measure result consistency.We validated the CBDA 2.0 technique using synthetic datasets as well as a population-wide census-like study, which grounds the mathematical models and the computational algorithm into translational health research settings. Specifically, we empirically validated the CBDA technique on a large-scale clinical study (UK Biobank), which includes imaging, cognitive, and clinical assessment data. The UK Biobank archive presents several difficult challenges related to the aggregation, harmonization, modeling, and interrogation of the information. These problems are related to the complex longitudinal structure, feature heterogeneity, multicollinearity, incongruency, and missingness, as well as violations of classical parametric assumptions that require novel health analytical approaches.Our results showcase the scalability, efficiency and potential of CBDA to compress complex data into structural information leading to derived knowledge and translational action. The results of the real case-study suggest new and exciting avenues of research in the context of identifying, tracking, and treating mental health and aging-related disorders. Following open-science principles, we share the entire end-to-end protocol, source-code, and results. This facilitates independent validation, result reproducibility, and team-based collaborative discovery.

Keywords

This publication has 29 references indexed in Scilit:

Predictive Big Data Analytics using the UK Biobank Data
Scientific Reports, 2019
Controlled feature selection and compressive big data analytics: Applications to biomedical and health studies
PLOS ONE, 2018
Data Science and Predictive Analytics
Published by Springer Science and Business Media LLC ,2018
XGBoost
Published by Association for Computing Machinery (ACM) ,2016
Challenges of Big Data analysis
National Science Review, 2014
UK Biobank: bank on it
The Lancet, 2007
Super Learner
Statistical Applications in Genetics and Molecular Biology, 2007
Soft Margins for AdaBoost
Machine Learning, 2001
Support vector machines
IEEE Intelligent Systems and their Applications, 1998
Nearest neighbor pattern classification
IEEE Transactions on Information Theory, 1967

Cited by 1 article