A Massively Parallel, Energy Efficient Programmable Accelerator for Learning and Classification
- 1 March 2012
- journal article
- research article
- Published by Association for Computing Machinery (ACM) in ACM Transactions on Architecture and Code Optimization
- Vol. 9 (1), 1-30
- https://doi.org/10.1145/2133382.2133388
Abstract
Applications that use learning and classification algorithms operate on large amounts of unstructured data, and have stringent performance constraints. For such applications, the performance of general purpose processors scales poorly with data size because of their limited support for fine-grained parallelism and absence of software-managed caches. The large intermediate data in these applications also limits achievable performance on many-core processors such as GPUs. To accelerate such learning applications, we present a programmable accelerator that can execute multiple learning and classification algorithms. To architect such an accelerator, we profile five representative workloads, and find that their computationally intensive portions can be formulated as matrix or vector operations generating large amounts of intermediate data, which are then reduced by a secondary operation such as array ranking, finding max/min and aggregation. Our proposed accelerator, called MAPLE, has hundreds of simple processing elements (PEs) laid out in a two-dimensional grid, with two key features. First, it uses dynamic in-memory processing where on-chip memory blocks perform the secondary reduction operations. Second, MAPLE uses banked off-chip memory, and organizes its PEs into independent groups each with its own off-chip memory bank. These two features allow MAPLE to scale its performance with data size. We also present an Atom based energy-efficient heterogeneous system with MAPLE as the accelerator that satisfies the application’s performance requirements at a lower system power. This article describes the MAPLE architecture, explores its design space with a simulator, illustrates how to automatically map application kernels to the hardware, and presents its performance improvement and energy benefits over classic server-based implementations. We implement a 512-PE FPGA prototype of MAPLE and find that it is 1.5-10x faster than a 2.5 GHz quad-core Xeon processor despite running at a modest 125 MHz clock rate. With MAPLE connected to a 1.6GHz dual-core Atom, we show an energy improvement of 38-84% over the Xeon server coupled to a 1.3 GHz 240 core Tesla GPU.Keywords
This publication has 20 references indexed in Scilit:
- Learning to rank with (a lot of) word featuresInformation Retrieval Journal, 2009
- Face Detection Using GPU-Based Convolutional Neural NetworksLecture Notes in Computer Science, 2009
- LarrabeeACM Transactions on Graphics, 2008
- Image retrievalACM Computing Surveys, 2008
- Fast support vector machine training and classification on graphics processorsPublished by Association for Computing Machinery (ACM) ,2008
- VideoSensePublished by Association for Computing Machinery (ACM) ,2007
- A Survey of General‐Purpose Computation on Graphics HardwareComputer Graphics Forum, 2007
- The Raw microprocessor: a computational fabric for software circuits and general-purpose programsIEEE Micro, 2002
- Gradient-based learning applied to document recognitionProceedings of the IEEE, 1998
- Least squares quantization in PCMIEEE Transactions on Information Theory, 1982