A Massively Parallel, Energy Efficient Programmable Accelerator for Learning and Classification

1 March 2012

journal article
research article
Published by Association for Computing Machinery (ACM) in ACM Transactions on Architecture and Code Optimization

Vol. 9 (1), 1-30
https://doi.org/10.1145/2133382.2133388

Abstract

Applications that use learning and classification algorithms operate on large amounts of unstructured data, and have stringent performance constraints. For such applications, the performance of general purpose processors scales poorly with data size because of their limited support for fine-grained parallelism and absence of software-managed caches. The large intermediate data in these applications also limits achievable performance on many-core processors such as GPUs. To accelerate such learning applications, we present a programmable accelerator that can execute multiple learning and classification algorithms. To architect such an accelerator, we profile five representative workloads, and find that their computationally intensive portions can be formulated as matrix or vector operations generating large amounts of intermediate data, which are then reduced by a secondary operation such as array ranking, finding max/min and aggregation. Our proposed accelerator, called MAPLE, has hundreds of simple processing elements (PEs) laid out in a two-dimensional grid, with two key features. First, it uses dynamic in-memory processing where on-chip memory blocks perform the secondary reduction operations. Second, MAPLE uses banked off-chip memory, and organizes its PEs into independent groups each with its own off-chip memory bank. These two features allow MAPLE to scale its performance with data size. We also present an Atom based energy-efficient heterogeneous system with MAPLE as the accelerator that satisfies the application’s performance requirements at a lower system power. This article describes the MAPLE architecture, explores its design space with a simulator, illustrates how to automatically map application kernels to the hardware, and presents its performance improvement and energy benefits over classic server-based implementations. We implement a 512-PE FPGA prototype of MAPLE and find that it is 1.5-10x faster than a 2.5 GHz quad-core Xeon processor despite running at a modest 125 MHz clock rate. With MAPLE connected to a 1.6GHz dual-core Atom, we show an energy improvement of 38-84% over the Xeon server coupled to a 1.3 GHz 240 core Tesla GPU.

Keywords

This publication has 20 references indexed in Scilit:

Learning to rank with (a lot of) word features
Information Retrieval Journal, 2009
Face Detection Using GPU-Based Convolutional Neural Networks
Lecture Notes in Computer Science, 2009
Larrabee
ACM Transactions on Graphics, 2008
Image retrieval
ACM Computing Surveys, 2008
Fast support vector machine training and classification on graphics processors
Published by Association for Computing Machinery (ACM) ,2008
VideoSense
Published by Association for Computing Machinery (ACM) ,2007
A Survey of General‐Purpose Computation on Graphics Hardware
Computer Graphics Forum, 2007
The Raw microprocessor: a computational fabric for software circuits and general-purpose programs
IEEE Micro, 2002
Gradient-based learning applied to document recognition
Proceedings of the IEEE, 1998
Least squares quantization in PCM
IEEE Transactions on Information Theory, 1982

Cited by 36 articles