Look-Up Table based Energy Efficient Processing in Cache Support for Neural Network Acceleration

1 October 2020

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

p. 88-101
https://doi.org/10.1109/micro50266.2020.00020

Abstract

This paper presents a Look-Up Table (LUT) based Processing-In-Memory (PIM) technique with the potential for running Neural Network inference tasks. We implement a bitline computing free technique to avoid frequent bitline accesses to the cache sub-arrays and thereby considerably reducing the memory access energy overhead. LUT in conjunction with the compute engines enables sub-array level parallelism while executing complex operations through data lookup which otherwise requires multiple cycles. Sub-array level parallelism and systolic input data flow ensure data movement to be confined to the SRAM slice.Our proposed LUT based PIM methodology exploits substantial parallelism using look-up tables, which does not alter the memory structure/organization, that is, preserving the bit-cell and peripherals of the existing SRAM monolithic arrays. Our solution achieves 1.72x higher performance and 3.14x lower energy as compared to a state-of-the-art processing-in-cache solution. Sub-array level design modifications to incorporate LUT along with the compute engines will increase the overall cache area by 5.6%. We achieve 3.97x speedup w.r.t neural network systolic accelerator with a similar area. The re-configurable nature of the compute engines enables various neural network operations and thereby supporting sequential networks (RNNs) and transformer models. Our quantitative analysis demonstrates 101x, 3x faster execution and 91x, 11x energy efficient than CPU and GPU respectively while running the transformer model, BERT-Base.

Keywords

This publication has 42 references indexed in Scilit:

A 1.2 V 8 Gb 8-Channel 128 GB/s High-Bandwidth Memory (HBM) Stacked DRAM With Effective I/O Test Circuits
IEEE Journal of Solid-State Circuits, 2014
A 1 Gb 2 GHz 128 GB/s Bandwidth Embedded DRAM in 22 nm Tri-Gate CMOS Technology
IEEE Journal of Solid-State Circuits, 2014
Design of Arbitrary Waveform Generator based on Direct Digital Synthesis Technique using Code Composer Studio Platform
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2007
Table-lookup algorithms for elementary functions and their error analysis
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2002
Microservers
Published by Association for Computing Machinery (ACM) ,1999
Mapping irregular applications to DIVA, a PIM-based data-intensive architecture
Published by Association for Computing Machinery (ACM) ,1999
Piecewise linear approximation applied to nonlinear function of a neural network
IEE Proceedings - Circuits, Devices and Systems, 1997
Hitting the memory wall
ACM SIGARCH Computer Architecture News, 1995
An approach to implementing multiplication with small tables
International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1990
An interpolating memory unit for function evaluation: analysis and design
International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1989

Cited by 24 articles