Hybrid parallelization strategies for large-scale machine learning in SystemML
- 1 March 2014
- journal article
- Published by Association for Computing Machinery (ACM) in Proceedings of the VLDB Endowment
- Vol. 7 (7), 553-564
- https://doi.org/10.14778/2732286.2732292
Abstract
SystemML aims at declarative, large-scale machine learning (ML) on top of MapReduce, where high-level ML scripts with R-like syntax are compiled to programs of MR jobs. The declarative specification of ML algorithms enables---in contrast to existing large-scale machine learning libraries---automatic optimization. SystemML's primary focus is on data parallelism but many ML algorithms inherently exhibit opportunities for task parallelism as well. A major challenge is how to efficiently combine both types of parallelism for arbitrary ML scripts and workloads. In this paper, we present a systematic approach for combining task and data parallelism for large-scale machine learning on top of MapReduce. We employ a generic Parallel FOR construct (ParFOR) as known from high performance computing (HPC). Our core contributions are (1) complementary parallelization strategies for exploiting multi-core and cluster parallelism, as well as (2) a novel cost-based optimization framework for automatically creating optimal parallel execution plans. Experiments on a variety of use cases showed that this achieves both efficiency and scalability due to automatic adaptation to ad-hoc workloads and unknown data characteristics.Keywords
This publication has 18 references indexed in Scilit:
- Opening the black boxes in data flow optimizationProceedings of the VLDB Endowment, 2012
- StubbyProceedings of the VLDB Endowment, 2012
- Large-scale machine learning at twitterPublished by Association for Computing Machinery (ACM) ,2012
- Optimizing I/O for big array analyticsProceedings of the VLDB Endowment, 2012
- Distributed GraphLabProceedings of the VLDB Endowment, 2012
- MRShareProceedings of the VLDB Endowment, 2010
- HiveProceedings of the VLDB Endowment, 2009
- MAD skillsProceedings of the VLDB Endowment, 2009
- MATLAB®: A Language for Parallel ComputingInternational Journal of Parallel Programming, 2008
- Factoring: a practical and robust method for scheduling parallel loopsPublished by Association for Computing Machinery (ACM) ,1991