Orchestra

17 May 2022

conference paper
conference paper
Published by Association for Computing Machinery (ACM)

https://doi.org/10.1145/3528416.3530246

Abstract

The synchronized Local-SGD(Stochastic gradient descent) strategy becomes a more popular in distributed deep learning (DML) since it can effectively reduce the frequency of model communication and ensure global model convergence. However, it works not well and leads to excessive training time in heterogeneous environments due to the difference in workers' performance. Especially, in some data unbalanced scenarios, these differences between workers may aggravate low utilization of resources and eventually lead to stragglers, which seriously hurt the whole training procedure. Existing solutions either suffer from a heterogeneity of computing resources or do not fully address the environment dynamics. In this paper, we eliminate the negative impacts of dynamic resource constraints issues in heterogeneous DML environments with a novel, adaptive load-balancing framework called Orchestra. The main idea of Orchestra is to improve resource utilization by load balance between worker performance and the unbalance of data volume. Additionally, one of Orchestra's strongest features is the number of local updates adaptation at each epoch per worker. To achieve this improvement, we propose a distributed deep reinforcement learning-driven algorithm for per-worker to dynamically determine the number of local updates adaptation and training data volume, subject to mini-batch cost time and resource constraints at each epoch. Our design significantly improves the convergence speed of the model in DML compared with other state-of-the-art.

Keywords

This publication has 6 references indexed in Scilit:

A Span-Extraction Dataset for Chinese Machine Reading Comprehension
Published by Association for Computational Linguistics (ACL) ,2019
ImageNet Training in Minutes
Published by Association for Computing Machinery (ACM) ,2018
Encoded distributed optimization
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2017
Deep Residual Learning for Image Recognition
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2016
Efficient mini-batch training for stochastic optimization
Published by Association for Computing Machinery (ACM) ,2014
Improving deep neural network acoustic models using generalized maxout networks
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2014

Cited by 1 article