Scalpel

24 June 2017

journal article
conference paper
Published by Association for Computing Machinery (ACM) in ACM SIGARCH Computer Architecture News

Vol. 45 (2), 548-560
https://doi.org/10.1145/3140659.3080215

Abstract

As the size of Deep Neural Networks (DNNs) continues to grow to increase accuracy and solve more complex problems, their energy footprint also scales. Weight pruning reduces DNN model size and the computation by removing redundant weights. However, we implemented weight pruning for several popular networks on a variety of hardware platforms and observed surprising results. For many networks, the network sparsity caused by weight pruning will actually hurt the overall performance despite large reductions in the model size and required multiply-accumulate operations. Also, encoding the sparse format of pruned networks incurs additional storage space overhead. To overcome these challenges, we propose Scalpel that customizes DNN pruning to the underlying hardware by matching the pruned network structure to the data-parallel hardware organization. Scalpel consists of two techniques: SIMD-aware weight pruning and node pruning. For low-parallelism hardware (e.g., microcontroller), SIMD-aware weight pruning maintains weights in aligned fixed-size groups to fully utilize the SIMD units. For high-parallelism hardware (e.g., GPU), node pruning removes redundant nodes, not redundant weights, thereby reducing computation without sacrificing the dense matrix format. For hardware with moderate parallelism (e.g., desktop CPU), SIMD-aware weight pruning and node pruning are synergistically applied together. Across the microcontroller, CPU and GPU, Scalpel achieves mean speedups of 3.54x, 2.61x, and 1.25x while reducing the model sizes by 88%, 82%, and 53%. In comparison, traditional weight pruning achieves mean speedups of 1.90x, 1.06x, 0.41x across the three platforms.

Keywords

This publication has 21 references indexed in Scilit:

PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2016
Eyeriss
ACM SIGARCH Computer Architecture News, 2016
Fast R-CNN
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2015
ShiDianNao
Published by Association for Computing Machinery (ACM) ,2015
Going deeper with convolutions
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2015
DaDianNao: A Machine-Learning Supercomputer
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2014
Caffe
Published by Association for Computing Machinery (ACM) ,2014
Multiframe deep neural networks for acoustic modeling
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2013
Optimal Brain Surgeon and general network pruning
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
Gradient-based learning applied to document recognition
Proceedings of the IEEE, 1998

Cited by 126 articles