Specializing FGPU for Persistent Deep Learning

30 June 2021

journal article
research article
Published by Association for Computing Machinery (ACM) in ACM Transactions on Reconfigurable Technology and Systems

Vol. 14 (2), 1-23
https://doi.org/10.1145/3457886

Abstract

Overlay architectures are a good way to enable fast development and debug on FPGAs at the expense of potentially limited performance compared to fully customized FPGA designs. When used in concert with hand-tuned FPGA solutions, performant overlay architectures can improve time-to-solution and thus overall productivity of FPGA solutions. This work tunes and specializes FGPU, an open source OpenCL-programmable GPU overlay for FPGAs. We demonstrate that our persistent deep learning (PDL)-FGPU architecture maintains the ease-of-programming and generality of GPU programming while achieving high performance from specialization for the persistent deep learning domain. We also propose an easy method to specialize for other domains. PDL-FGPU includes new instructions, along with micro-architecture and compiler enhancements. We evaluate both the FGPU baseline and the proposed PDL-FGPU on a modern high-end Intel Stratix 10 2800 FPGA in simulation running persistent DL applications (RNN, GRU, LSTM), and non-DL applications to demonstrate generality. PDL-FGPU requires 1.4–3× more ALMs, 4.4–6.4× more M20ks, and 1–9.5× more DSPs than baseline, but improves performance by 56–693× for PDL applications with an average 23.1% degradation on non-PDL applications. We integrated the PDL-FGPU overlay into Intel OPAE to measure real-world performance/power and demonstrate that PDL-FGPU is only 4.0–10.4× slower than the Nvidia V100.

Keywords

Funding Information

National Science Foundation (1205721)

This publication has 16 references indexed in Scilit:

Serving DNNs in Real Time at Datacenter Scale with Project Brainwave
IEEE Micro, 2018
General-Purpose Computing with Soft GPUs on FPGAs
ACM Transactions on Reconfigurable Technology and Systems, 2018
Gate-variants of Gated Recurrent Unit (GRU) neural networks
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2017
FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2017
ESE
Published by Association for Computing Machinery (ACM) ,2017
FPGA-based accelerator for long short-term memory recurrent neural networks
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2017
An FPGA implementation of a long short-term memory neural network
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2016
MIAOW - An open source RTL implementation of a GPGPU
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2015
A GPU-inspired soft processor for high-throughput acceleration
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2010
VESPA
Published by Association for Computing Machinery (ACM) ,2008

Cited by 4 articles