Load‐balanced and locality‐aware scheduling for data‐intensive workloads at extreme scales

14 August 2015

journal article
research article
Published by Wiley in Concurrency and Computation: Practice and Experience

Vol. 28 (1), 70-94
https://doi.org/10.1002/cpe.3617

Abstract

Data‐driven programming models such as many‐task computing (MTC) have been prevalent for running data‐intensive scientific applications. MTC applies over‐decomposition to enable distributed scheduling. To achieve extreme scalability, MTC proposes a fully distributed task scheduling architecture that employs as many schedulers as the compute nodes to make scheduling decisions. Achieving distributed load balancing and best exploiting data locality are two important goals for the best performance of distributed scheduling of data‐intensive applications. Our previous research proposed a data‐aware work‐stealing technique to optimize both load balancing and data locality by using both dedicated and shared task ready queues in each scheduler. Tasks were organized in queues based on the input data size and location. Distributed key‐value store was applied to manage task metadata. We implemented the technique in MATRIX, a distributed MTC task execution framework. In this work, we devise an analytical suboptimal upper bound of the proposed technique, compare MATRIX with other scheduling systems, and explore the scalability of the technique at extreme scales. Results show that the technique is not only scalable but can achieve performance within 15% of the suboptimal solution. Copyright © 2015 John Wiley & Sons, Ltd.

Keywords

Funding Information

U.S. Department of Energy (DE-FC02-06ER25750)
National Science Foundation (NSF) (OCI-1054974, CNS-1042543)

This publication has 44 references indexed in Scilit:

Impact of Over-Decomposition on Coordinated Checkpoint/Rollback Protocol
Lecture Notes in Computer Science, 2012
Software challenges in extreme scale systems
Journal of Physics: Conference Series, 2009
Analysis of size interval task assignment policies
ACM SIGMETRICS Performance Evaluation Review, 2008
Distributed computing in practice: the Condor experience
Concurrency and Computation: Practice and Experience, 2005
Job placement with unknown duration and no preemption
ACM SIGMETRICS Performance Evaluation Review, 2001
The implementation of the Cilk-5 multithreaded language
ACM SIGPLAN Notices, 1998
An approximate analysis of the join the shortest queue (JSQ) policy
IEEE Transactions on Parallel and Distributed Systems, 1996
Scalable Load Balancing Techniques for Parallel Computers
Journal of Parallel and Distributed Computing, 1994
Strategies for dynamic load balancing on highly parallel computers
IEEE Transactions on Parallel and Distributed Systems, 1993
NP-complete scheduling problems
Journal of Computer and System Sciences, 1975

Cited by 24 articles