Compressive Big Data Analytics: An Ensemble Meta-Algorithm for High-dimensional Multisource Datasets

Abstract
Health advances are contingent on continuous development of new methods and approaches to foster data driven discovery in the biomedical and clinical health sciences. Open-science offers hope for tackling some of the challenges associated with Big Data and team-based scientific discovery. Domain-independent reproducibility, area-specific replicability, curation, analysis, organization, management and sharing of health-related digital objects are critical components.This study expands the functionality and utility of an ensemble semi-supervised machine learning technique called Compressive Big Data Analytics (CBDA). Applied to high-dimensional data, CBDA identifies salient features and key biomarkers for reliable and reproducible forecasting of binary or multinomial outcomes. The method relies on iterative subsampling, combines function optimization and statistical inference, and generates ensemble predictions of observed univariate outcomes. In this manuscript, we extend the CBDA technique by (1) efficiently handling extremely large datasets, (2) generalizing the internal and external validation steps, (3) expanding the set of base-learners for joint ensemble prediction, (4) introduce an automated selection of CBDA specifications, and (5) provide mechanisms to assess CBDA convergence, evaluate the prediction accuracy, and measure result consistency.We validated the CBDA 2.0 technique using synthetic datasets as well as a population-wide census-like study, which grounds the mathematical models and the computational algorithm into translational health research settings. Specifically, we empirically validated the CBDA technique on a large-scale clinical study (UK Biobank), which includes imaging, cognitive, and clinical assessment data. The UK Biobank archive presents several difficult challenges related to the aggregation, harmonization, modeling, and interrogation of the information. These problems are related to the complex longitudinal structure, feature heterogeneity, multicollinearity, incongruency, and missingness, as well as violations of classical parametric assumptions that require novel health analytical approaches.Our results showcase the scalability, efficiency and potential of CBDA to compress complex data into structural information leading to derived knowledge and translational action. The results of the real case-study suggest new and exciting avenues of research in the context of identifying, tracking, and treating mental health and aging-related disorders. Following open-science principles, we share the entire end-to-end protocol, source-code, and results. This facilitates independent validation, result reproducibility, and team-based collaborative discovery.

This publication has 29 references indexed in Scilit: