Eyeriss
Top Cited Papers
- 18 June 2016
- journal article
- conference paper
- Published by Association for Computing Machinery (ACM) in ACM SIGARCH Computer Architecture News
- Vol. 44 (3), 367-379
- https://doi.org/10.1145/3007787.3001177
Abstract
Deep convolutional neural networks (CNNs) are widely used in modern AI systems for their superior accuracy but at the cost of high computational complexity. The complexity comes from the need to simultaneously process hundreds of filters and channels in the high-dimensional convolutions, which involve a significant amount of data movement. Although highly-parallel compute paradigms, such as SIMD/SIMT, effectively address the computation requirement to achieve high throughput, energy consumption still remains high as data movement can be more expensive than computation. Accordingly, finding a dataflow that supports parallel processing with minimal data movement cost is crucial to achieving energy-efficient CNN processing without compromising accuracy. In this paper, we present a novel dataflow, called row-stationary (RS), that minimizes data movement energy consumption on a spatial architecture. This is realized by exploiting local data reuse of filter weights and feature map pixels, i.e., activations, in the high-dimensional convolutions, and minimizing data movement of partial sum accumulations. Unlike dataflows used in existing designs, which only reduce certain types of data movement, the proposed RS dataflow can adapt to different CNN shape configurations and reduces all types of data movement through maximally utilizing the processing engine (PE) local storage, direct inter-PE communication and spatial parallelism. To evaluate the energy efficiency of the different dataflows, we propose an analysis framework that compares energy cost under the same hardware area and processing parallelism constraints. Experiments using the CNN configurations of AlexNet show that the proposed RS dataflow is more energy efficient than existing dataflows in both convolutional (1.4× to 2.5×) and fully-connected layers (at least 1.3× for batch size larger than 16). The RS dataflow has also been demonstrated on a fabricated chip, which verifies our energy analysis.Keywords
This publication has 36 references indexed in Scilit:
- ShiDianNaoPublished by Association for Computing Machinery (ACM) ,2015
- Exploring the potential of heterogeneous von neumann/dataflow execution modelsPublished by Association for Computing Machinery (ACM) ,2015
- Deep learningNature, 2015
- OrigamiPublished by Association for Computing Machinery (ACM) ,2015
- Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural NetworksPublished by Association for Computing Machinery (ACM) ,2015
- CaffePublished by Association for Computing Machinery (ACM) ,2014
- Exploiting spatial architectures for edit distance algorithmsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2014
- DianNaoACM SIGPLAN Notices, 2014
- The WaveScalar architectureACM Transactions on Computer Systems, 2007
- ADRES: An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained Reconfigurable MatrixLecture Notes in Computer Science, 2003