ACM Transactions on Reconfigurable Technology and Systems

Journal Information
ISSN / EISSN : 1936-7406 / 1936-7414
Total articles ≅ 377
Current Coverage
Archived in

Latest articles in this journal

Enrico Reggiani, Emanuele DEL Sozzo, Davide Conficconi, Giuseppe Natale, Carlo Moroni, Marco D. Santambrogio
ACM Transactions on Reconfigurable Technology and Systems, Volume 14, pp 1-33;

Stencil-based algorithms are a relevant class of computational kernels in high-performance systems, as they appear in a plethora of fields, from image processing to seismic simulations, from numerical methods to physical modeling. Among the various incarnations of stencil-based computations, Iterative Stencil Loops (ISLs) and Convolutional Neural Networks (CNNs) represent two well-known examples of kernels belonging to the stencil class. Indeed, ISLs apply the same stencil several times until convergence, while CNN layers leverage stencils to extract features from an image. The computationally intensive essence of ISLs, CNNs, and in general stencil-based workloads, requires solutions able to produce efficient implementations in terms of throughput and power efficiency. In this context, FPGAs are ideal candidates for such workloads, as they allow design architectures tailored to the stencil regular computational pattern. Moreover, the ever-growing need for performance enhancement leads FPGA-based architectures to scale to multiple devices to benefit from a distributed acceleration. For this reason, we propose a library of HDL components to effectively compute ISLs and CNNs inference on FPGA, along with a scalable multi-FPGA architecture, based on custom PCB interconnects. Our solution eases the design flow and guarantees both scalability and performance competitive with state-of-the-art works.
Endri Taka, Konstantinos Maragos, George Lentaris, Dimitrios Soudris
ACM Transactions on Reconfigurable Technology and Systems, Volume 14, pp 1-30;

In the current work, we study the process variability of logic, interconnect, and arithmetic/DSP resources in commercial 16-nm FPGAs. We create multiple, soft-macro sensors for each distinct resource under evaluation, and we deploy them across the FPGA fabric to measure intra-die variation, as well as across multiple FPGAs to measure inter-die variation. The derived results are used to create device-signature variability maps characterizing the distribution of variability across the die. Our study includes decoupling of variability to systematic and stochastic parts, exploration of variability under various voltage and temperature conditions and correlation analysis between the variability maps of the different resources. Furthermore, we scrutinize the impact of variability on the performance of actual test circuits and correlate the retrieved results with the sensor-based maps. Our experimental results on four Zynq XCZU7EV FPGAs showed significant intra- and inter-die variability, up to 7.8% and 8.9%, respectively, with a small increase under certain operating conditions. The correlation analysis demonstrated a strong correlation between the logic and arithmetic resources, whereas the interconnects showed a slightly weaker correlation in specific devices. Finally, a relatively moderate correlation was calculated between the variability maps and performance of test circuits due their dissimilar operating behavior versus our sensors.
, Hannah Szentimrey, Ahmed Shamli, Timothy Martin, Gary Gréwal, Shawki Areibi
ACM Transactions on Reconfigurable Technology and Systems, Volume 14, pp 1-28;

The ability to accurately and efficiently estimate the routability of a circuit based on its placement is one of the most challenging and difficult tasks in the Field Programmable Gate Array (FPGA) flow. In this article, we present a novel, deep learning framework based on a Convolutional Neural Network (CNN) model for predicting the routability of a placement. Since the performance of the CNN model is strongly dependent on the hyper-parameters selected for the model, we perform an exhaustive parameter tuning that significantly improves the model’s performance and we also avoid overfitting the model. We also incorporate the deep learning model into a state-of-the-art placement tool and show how the model can be used to (1) avoid costly, but futile, place-and-route iterations, and (2) improve the placer’s ability to produce routable placements for hard-to-route circuits using feedback based on routability estimates generated by the proposed model. The model is trained and evaluated using over 26K placement images derived from 372 benchmarks supplied by Xilinx Inc. We also explore several opportunities to further improve the reliability of the predictions made by the proposed DLRoute technique by splitting the model into two separate deep learning models for (a) global and (b) detailed placement during the optimization process. Experimental results show that the proposed framework achieves a routability prediction accuracy of 97% while exhibiting runtimes of only a few milliseconds.
Arif Sasongko, I. M. Narendra Kumara, Arief Wicaksana, Frédéric Rousseau, Olivier Muller
ACM Transactions on Reconfigurable Technology and Systems, Volume 14, pp 1-25;

The confidentiality and integrity of a stream has become one of the biggest issues in telecommunication. The best available algorithm handling the confidentiality of a data stream is the symmetric key block cipher combined with a chaining mode of operation such as cipher block chaining (CBC) or counter mode (CTR). This scheme is difficult to accelerate using hardware when multiple streams coexist. This is caused by the computation time requirement and mainly by management of the streams. In most accelerators, computation is treated at the block-level rather than as a stream, making the management of multiple streams complex. This article presents a solution combining CBC and CTR modes of operation with a hardware context switching. The hardware context switching allows the accelerator to treat the data as a stream. Each stream can have different parameters: key, initialization value, state of counter. Stream switching was managed by the hardware context switching mechanism. A high-level synthesis tool was used to generate the context switching circuit. The scheme was tested on three cryptographic algorithms: AES, DES, and BC3. The hardware context switching allowed the software to manage multiple streams easily, efficiently, and rapidly. The software was freed of the task of managing the stream state. Compared to the original algorithm, about 18%–38% additional logic elements were required to implement the CBC or CTR mode and the additional circuits to support context switching. Using this method, the performance overhead when treating multiple streams was low, and the performance was comparable to that of existing hardware accelerators not supporting multiple streams.
Ryota Yasudo, José G. F. Coutinho, Ana-Lucia Varbanescu, Wayne Luk, Hideharu Amano, Tobias Becker, Ce Guo
ACM Transactions on Reconfigurable Technology and Systems, Volume 14, pp 1-21;

Next-generation high-performance computing platforms will handle extreme data- and compute-intensive problems that are intractable with today’s technology. A promising path in achieving the next leap in high-performance computing is to embrace heterogeneity and specialised computing in the form of reconfigurable accelerators such as FPGAs, which have been shown to speed up compute-intensive tasks with reduced power consumption. However, assessing the feasibility of large-scale heterogeneous systems requires fast and accurate performance prediction. This article proposes Performance Estimation for Reconfigurable Kernels and Systems (PERKS), a novel performance estimation framework for reconfigurable dataflow platforms. PERKS makes use of an analytical model with machine and application parameters for predicting the performance of multi-accelerator systems and detecting their bottlenecks. Model calibration is automatic, making the model flexible and usable for different machine configurations and applications, including hypothetical ones. Our experimental results show that PERKS can predict the performance of current workloads on reconfigurable dataflow platforms with an accuracy above 91%. The results also illustrate how the modelling scales to large workloads, and how performance impact of architectural features can be estimated in seconds.
Xavier Martorell, Carlos Alvarez, Christos-Savvas Bouganis, Ioannis Sourdis
ACM Transactions on Reconfigurable Technology and Systems, Volume 14, pp 1-1;

George Provelengios, Daniel Holcomb, Russell Tessier
ACM Transactions on Reconfigurable Technology and Systems, Volume 14, pp 1-24;

Recent research has exposed a number of security issues related to the use of FPGAs in embedded system and cloud computing environments. Circuits that deliberately waste power can be carefully crafted by a malicious cloud FPGA user and deployed to cause denial-of-service and fault injection attacks. The main defense strategy used by FPGA cloud services involves checking user-submitted designs for circuit structures that are known to aggressively consume power. Unfortunately, this approach is limited by an attacker’s ability to conceive new designs that defeat existing checkers. In this work, our contributions are twofold. We evaluate a variety of circuit power wasting techniques that typically are not flagged by design rule checks imposed by FPGA cloud computing vendors. The efficiencies of five power wasting circuits, including our new design, are evaluated in terms of power consumed per logic resource. We then show that the source of voltage attacks based on power wasters can be identified. Our monitoring approach localizes the attack and suppresses the clock signal for the target region within 21 μs, which is fast enough to stop an attack before it causes a board reset. All experiments are performed using a state-of-the-art Intel Stratix 10 FPGA.
Shenghsun Cho, Mrunal Patel, Michael Ferdman, Peter Milder
ACM Transactions on Reconfigurable Technology and Systems, Volume 14, pp 1-18;

Software verification is an important stage of the software development process, particularly for mission-critical systems. As the traditional methodology of using unit tests falls short of verifying complex software, developers are increasingly relying on formal verification methods, such as explicit state model checking, to automatically verify that the software functions properly. However, due to the ever-increasing complexity of software designs, model checking cannot be performed in a reasonable amount of time when running on general-purpose cores, leading to the exploration of hardware-accelerated model checking. FPGAs have been demonstrated to be promising verification accelerators, exhibiting nearly three orders of magnitude speedup over software. Unfortunately, the “FPGA programmability wall,” particularly the long synthesis and place-and-route times, block the general adoption of FPGAs for model checking. To address this problem, we designed a runtime-programmable pipeline specifically for model checkers on FPGAs to minimize the “preparation time” before a model can be checked. Our design of the successor state generator and the state validator modules enables FPGA-acceleration of model checking without incurring the time-consuming FPGA implementation stages, reducing the preparation time before checking a model from hours to less than a minute, while incurring only a 26% execution time overhead compared to model-specific implementations.
Rui Ma, Jia-Ching Hsu, Tian Tan, Eriko Nurvitadhi, David Sheffield, Rob Pelt, Martin Langhammer, Jaewoong Sim, Aravind Dasu, Derek Chiou
ACM Transactions on Reconfigurable Technology and Systems, Volume 14, pp 1-23;

Overlay architectures are a good way to enable fast development and debug on FPGAs at the expense of potentially limited performance compared to fully customized FPGA designs. When used in concert with hand-tuned FPGA solutions, performant overlay architectures can improve time-to-solution and thus overall productivity of FPGA solutions. This work tunes and specializes FGPU, an open source OpenCL-programmable GPU overlay for FPGAs. We demonstrate that our persistent deep learning (PDL )-FGPU architecture maintains the ease-of-programming and generality of GPU programming while achieving high performance from specialization for the persistent deep learning domain. We also propose an easy method to specialize for other domains. PDL-FGPU includes new instructions, along with micro-architecture and compiler enhancements. We evaluate both the FGPU baseline and the proposed PDL-FGPU on a modern high-end Intel Stratix 10 2800 FPGA in simulation running persistent DL applications (RNN, GRU, LSTM), and non-DL applications to demonstrate generality. PDL-FGPU requires 1.4–3× more ALMs, 4.4–6.4× more M20ks, and 1–9.5× more DSPs than baseline, but improves performance by 56–693× for PDL applications with an average 23.1% degradation on non-PDL applications. We integrated the PDL-FGPU overlay into Intel OPAE to measure real-world performance/power and demonstrate that PDL-FGPU is only 4.0–10.4× slower than the Nvidia V100.
Adriaan Peetermans, Vladimir Rožić,
ACM Transactions on Reconfigurable Technology and Systems, Volume 14, pp 1-20;

True Random Number Generators (TRNGs) are indispensable in modern cryptosystems. Unfortunately, to guarantee high entropy of the generated numbers, many TRNG designs require a complex implementation procedure, often involving manual placement and routing. In this work, we introduce, analyse, and compare three dynamic calibration mechanisms for the COherent Sampling ring Oscillator based TRNG: GateVar , WireVar , and LUTVar , enabling easy integration of the entropy source into complex systems. The TRNG setup procedure automatically selects a configuration that guarantees the security requirements. In the experiments, we show that two out of the three proposed mechanisms are capable of assuring correct TRNG operation even when an automatic placement is carried out and when the design is ported to another Field-Programmable Gate Array (FPGA) family. We generated random bits on both a Xilinx Spartan 7 and a Microsemi SmartFusion2 implementation that, without post processing, passed the AIS-31 statistical tests at a throughput of 4.65 Mbit/s and 1.47 Mbit/s, respectively.
Back to Top Top