Caffeine
- 7 November 2016
- conference paper
- conference paper
- Published by Association for Computing Machinery (ACM)
Abstract
With the recent advancement of multilayer convolutional neural networks (CNN), deep learning has achieved amazing success in many areas, especially in visual content understanding and classification. To improve the performance and energy-efficiency of the computation-demanding CNN, the FPGA-based acceleration emerges as one of the most attractive alternatives. In this paper we design and implement Caffeine, a hardware/software co-designed library to efficiently accelerate the entire CNN on FPGAs. First, we propose a uniformed convolutional matrix-multiplication representation for both computation-intensive convolutional layers and communication-intensive fully connected (FCN) layers. Second, we design Caffeine with the goal to maximize the underlying FPGA computing and bandwidth resource utilization, with a key focus on the bandwidth optimization by the memory access reorganization not studied in prior work. Moreover, we implement Caffeine in the portable high-level synthesis and provide various hardware/software definable parameters for user configurations. Finally, we also integrate Caffeine into the industry-standard software deep learning framework Caffe. We evaluate Caffeine and its integration with Caffe by implementing VGG16 and AlexNet network on multiple FPGA platforms. Caffeine achieves a peak performance of 365 GOPS on Xilinx KU060 FPGA and 636 GOPS on Virtex7 690t FPGA. This is the best published result to our best knowledge. We achieve more than 100x speedup on FCN layers over previous FPGA accelerators. An end-to-end evaluation with Caffe integration shows up to 7.3x and 43.5x performance and energy gains over Caffe on a 12-core Xeon server, and 1.5x better energy-efficiency over the GPU implementation on a medium-sized FPGA (KU060). Performance projections to a system with a high-end FPGA (Virtex7 690t) shows even higher gains.Keywords
This publication has 17 references indexed in Scilit:
- Going Deeper with Embedded FPGA Platform for Convolutional Neural NetworkPublished by Association for Computing Machinery (ACM) ,2016
- Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural NetworksPublished by Association for Computing Machinery (ACM) ,2016
- Going deeper with convolutionsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2015
- Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural NetworksPublished by Association for Computing Machinery (ACM) ,2015
- A SURVEY OF TECHNIQUES FOR MANAGING AND LEVERAGING CACHES IN GPUsJournal of Circuits, Systems and Computers, 2014
- DeepFace: Closing the Gap to Human-Level Performance in Face VerificationPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2014
- Rich Feature Hierarchies for Accurate Object Detection and Semantic SegmentationPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2014
- Improving high level synthesis optimization opportunity through polyhedral transformationsPublished by Association for Computing Machinery (ACM) ,2013
- A dynamically configurable coprocessor for convolutional neural networksACM SIGARCH Computer Architecture News, 2010
- RooflineCommunications of the ACM, 2009