Specializing FGPU for Persistent Deep Learning
- 30 June 2021
- journal article
- research article
- Published by Association for Computing Machinery (ACM) in ACM Transactions on Reconfigurable Technology and Systems
- Vol. 14 (2), 1-23
- https://doi.org/10.1145/3457886
Abstract
Overlay architectures are a good way to enable fast development and debug on FPGAs at the expense of potentially limited performance compared to fully customized FPGA designs. When used in concert with hand-tuned FPGA solutions, performant overlay architectures can improve time-to-solution and thus overall productivity of FPGA solutions. This work tunes and specializes FGPU, an open source OpenCL-programmable GPU overlay for FPGAs. We demonstrate that our persistent deep learning (PDL)-FGPU architecture maintains the ease-of-programming and generality of GPU programming while achieving high performance from specialization for the persistent deep learning domain. We also propose an easy method to specialize for other domains. PDL-FGPU includes new instructions, along with micro-architecture and compiler enhancements. We evaluate both the FGPU baseline and the proposed PDL-FGPU on a modern high-end Intel Stratix 10 2800 FPGA in simulation running persistent DL applications (RNN, GRU, LSTM), and non-DL applications to demonstrate generality. PDL-FGPU requires 1.4–3× more ALMs, 4.4–6.4× more M20ks, and 1–9.5× more DSPs than baseline, but improves performance by 56–693× for PDL applications with an average 23.1% degradation on non-PDL applications. We integrated the PDL-FGPU overlay into Intel OPAE to measure real-world performance/power and demonstrate that PDL-FGPU is only 4.0–10.4× slower than the Nvidia V100.Keywords
Funding Information
- National Science Foundation (1205721)
This publication has 16 references indexed in Scilit:
- Serving DNNs in Real Time at Datacenter Scale with Project BrainwaveIEEE Micro, 2018
- General-Purpose Computing with Soft GPUs on FPGAsACM Transactions on Reconfigurable Technology and Systems, 2018
- Gate-variants of Gated Recurrent Unit (GRU) neural networksPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2017
- FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid TemplatesPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2017
- ESEPublished by Association for Computing Machinery (ACM) ,2017
- FPGA-based accelerator for long short-term memory recurrent neural networksPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2017
- An FPGA implementation of a long short-term memory neural networkPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2016
- MIAOW - An open source RTL implementation of a GPGPUPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2015
- A GPU-inspired soft processor for high-throughput accelerationPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2010
- VESPAPublished by Association for Computing Machinery (ACM) ,2008