heSRPT

5 March 2021

journal article
research article
Published by Association for Computing Machinery (ACM) in ACM SIGMETRICS Performance Evaluation Review

Vol. 48 (3), 35-36
https://doi.org/10.1145/3453953.3453960

Abstract

Modern data centers serve workloads which can exploit parallelism. When a job parallelizes across multiple servers it completes more quickly. However, it is unclear how to share a limited number of servers between many parallelizable jobs. In this paper we consider a typical scenario where a data center composed of N servers will be tasked with completing a set of M parallelizable jobs. Typically, M is much smaller than N. In our scenario, each job consists of some amount of inherent work which we refer to as a job's size. We assume that job sizes are known up front to the system, and each job can utilize any number of servers at any moment in time. These assumptions are reasonable for many parallelizable workloads such as training neural networks using TensorFlow [2]. Our goal in this paper is to allocate servers to jobs so as to minimize the mean slowdown across all jobs, where the slowdown of a job is the job's completion time divided by its running time if given exclusive access to all N servers. Slowdown measures how a job was interfered with by other jobs in the system, and is often the metric of interest in the theoretical parallel scheduling literature (where it is also called stretch), as well as the HPC community (where it is called expansion factor).

Keywords

This publication has 3 references indexed in Scilit:

A Model-Based Approach to Streamlining Distributed Training for Asynchronous SGD
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2018
Nearly insensitive bounds on SMART scheduling
ACM SIGMETRICS Performance Evaluation Review, 2005
Technical Note—A New Proof of the Optimality of the Shortest Remaining Processing Time Discipline
Operations Research, 1978