NASPipe: high performance and reproducible pipeline parallel supernet training via causal synchronous parallelism

conference paper
conference paper
Published by Association for Computing Machinery (ACM) in Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

https://doi.org/10.1145/3503222.3507735

Abstract

Supernet training, a prevalent and important paradigm in Neural Architecture Search, embeds the whole DNN architecture search space into one monolithic supernet, iteratively activates a subset of the supernet (i.e., a subnet) for fitting each batch of data, and searches a high-quality subnet which meets specific requirements. Although training subnets in parallel on multiple GPUs is desirable for acceleration, there inherently exists a race hazard that concurrent subnets may access the same DNN layers. Existing systems support neither efficiently parallelizing subnets’ training executions, nor resolving the race hazard deterministically, leading to unreproducible training procedures and potentiallly non-trivial accuracy loss. We present NASPipe, the first high-performance and reproducible distributed supernet training system via causal synchronous parallel (CSP) pipeline scheduling abstraction: NASPipe partitions a supernet across GPUs and concurrently executes multiple generated sub-tasks (subnets) in a pipelined manner; meanwhile, it oversees the correlations between the subnets and deterministically resolves any causal dependency caused by subnets’ layer sharing. To obtain high performance, NASPipe’s CSP scheduler exploits the fact that the larger a supernet spans, the fewer dependencies manifest between chronologically close subnets; therefore, it aggressively schedules the subnets with larger chronological orders into execution, only if they are not causally dependent on unfinished precedent subnets. Moreover, to relieve the excessive GPU memory burden for holding the whole supernet’s parameters, NASPipe uses a context switch technique that stashes the whole supernet in CPU memory, precisely predicts the subnets’ schedule, and pre-fetches/evicts a subnet before/after its execution. The evaluation shows that NASPipe is the only system that retains supernet training reproducibility, while achieving a comparable and even higher performance (up to 7.8X) compared to three recent pipeline training systems (e.g., GPipe).

Keywords

Funding Information

National Natural Science Foundation of China (61802358)
HK RGC GRF (17202318)
HK RGC GRF (17207117)

This publication has 26 references indexed in Scilit:

OWL: Understanding and Detecting Concurrency Attacks
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2018
Superneurons
ACM SIGPLAN Notices, 2018
OpenNMT System Description for WNMT 2018: 800 words/sec on a single-core CPU
Published by Association for Computational Linguistics (ACL) ,2018
FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2017
vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2016
Scaling Distributed Machine Learning with the Parameter Server
Published by Association for Computing Machinery (ACM) ,2014
ThreadSanitizer
Published by Association for Computing Machinery (ACM) ,2009
ImageNet: A large-scale hierarchical image database
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2009
Hybrid dynamic data race detection
ACM SIGPLAN Notices, 2003
The execution pipeline of the Intel i486 CPU
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002

Cited by 4 articles