Load‐balanced and locality‐aware scheduling for data‐intensive workloads at extreme scales
- 14 August 2015
- journal article
- research article
- Published by Wiley in Concurrency and Computation: Practice and Experience
- Vol. 28 (1), 70-94
- https://doi.org/10.1002/cpe.3617
Abstract
Data‐driven programming models such as many‐task computing (MTC) have been prevalent for running data‐intensive scientific applications. MTC applies over‐decomposition to enable distributed scheduling. To achieve extreme scalability, MTC proposes a fully distributed task scheduling architecture that employs as many schedulers as the compute nodes to make scheduling decisions. Achieving distributed load balancing and best exploiting data locality are two important goals for the best performance of distributed scheduling of data‐intensive applications. Our previous research proposed a data‐aware work‐stealing technique to optimize both load balancing and data locality by using both dedicated and shared task ready queues in each scheduler. Tasks were organized in queues based on the input data size and location. Distributed key‐value store was applied to manage task metadata. We implemented the technique in MATRIX, a distributed MTC task execution framework. In this work, we devise an analytical suboptimal upper bound of the proposed technique, compare MATRIX with other scheduling systems, and explore the scalability of the technique at extreme scales. Results show that the technique is not only scalable but can achieve performance within 15% of the suboptimal solution. Copyright © 2015 John Wiley & Sons, Ltd.Keywords
Funding Information
- U.S. Department of Energy (DE-FC02-06ER25750)
- National Science Foundation (NSF) (OCI-1054974, CNS-1042543)
This publication has 44 references indexed in Scilit:
- Impact of Over-Decomposition on Coordinated Checkpoint/Rollback ProtocolLecture Notes in Computer Science, 2012
- Software challenges in extreme scale systemsJournal of Physics: Conference Series, 2009
- Analysis of size interval task assignment policiesACM SIGMETRICS Performance Evaluation Review, 2008
- Distributed computing in practice: the Condor experienceConcurrency and Computation: Practice and Experience, 2005
- Job placement with unknown duration and no preemptionACM SIGMETRICS Performance Evaluation Review, 2001
- The implementation of the Cilk-5 multithreaded languageACM SIGPLAN Notices, 1998
- An approximate analysis of the join the shortest queue (JSQ) policyIEEE Transactions on Parallel and Distributed Systems, 1996
- Scalable Load Balancing Techniques for Parallel ComputersJournal of Parallel and Distributed Computing, 1994
- Strategies for dynamic load balancing on highly parallel computersIEEE Transactions on Parallel and Distributed Systems, 1993
- NP-complete scheduling problemsJournal of Computer and System Sciences, 1975