TABLA: A unified template-based framework for accelerating statistical machine learning
- 1 March 2016
- conference paper
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
Abstract
A growing number of commercial and enterprise systems increasingly rely on compute-intensive Machine Learning (ML) algorithms. While the demand for these compute-intensive applications is growing, the performance benefits from general-purpose platforms are diminishing. Field Programmable Gate Arrays (FPGAs) provide a promising path forward to accommodate the needs of machine learning algorithms and represent an intermediate point between the efficiency of ASICs and the programmability of general-purpose processors. However, acceleration with FPGAs still requires long development cycles and extensive expertise in hardware design. To tackle this challenge, instead of designing an accelerator for a machine learning algorithm, we present TABLA, a framework that generates accelerators for a class of machine learning algorithms. The key is to identify the commonalities across a wide range of machine learning algorithms and utilize this commonality to provide a high-level abstraction for programmers. TABLA leverages the insight that many learning algorithms can be expressed as a stochastic optimization problem. Therefore, learning becomes solving an optimization problem using stochastic gradient descent that minimizes an objective function over the training data. The gradient descent solver is fixed while the objective function changes for different learning algorithms. TABLA provides a template-based framework to accelerate this class of learning algorithms. Therefore, a developer can specify the learning task by only expressing the gradient of the objective function using our high-level language. Tabla then automatically generates the synthesizable implementation of the accelerator for FPGA realization using a set of hand-optimized templates. We use Tabla to generate accelerators for ten different learning tasks targeted at a Xilinx Zynq FPGA platform. We rigorously compare the benefits of FPGA acceleration to multi-core CPUs (ARM Cortex A15 and Xeon E3) and many-core GPUs (Tegra K1, GTX 650 Ti, and Tesla K40) using real hardware measurements. TABLA-generated accelerators provide 19.4x and 2.9x average speedup over the ARM and Xeon processors, respectively. These accelerators provide 17.57x, 20.2x, and 33.4x higher Performance-per-Watt in comparison to Tegra, GTX 650 Ti and Tesla, respectively. These benefits are achieved while the programmers write less than 50 lines of code.Keywords
This publication has 37 references indexed in Scilit:
- PuDianNaoPublished by Association for Computing Machinery (ACM) ,2015
- Parallel architectures for the kNN classifier -- design of soft IP cores and FPGA implementationsACM Transactions on Embedded Computing Systems, 2013
- Neural Acceleration for General-Purpose Approximate ProgramsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2012
- Model-driven Level 3 BLAS Performance Optimization on Loongson 3A ProcessorPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2012
- Factorization Machines with libFMACM Transactions on Intelligent Systems and Technology, 2012
- Toward Dark Silicon in ServersIEEE Micro, 2011
- FPGA implementation of K-means algorithm for bioinformatics application: An accelerated approach to clustering Microarray dataPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2011
- IP-cores design for the kNN classifierPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2010
- Real-time K-Means Clustering for Color Images on Reconfigurable HardwarePublished by Institute of Electrical and Electronics Engineers (IEEE) ,2006
- Gradient-based learning applied to document recognitionProceedings of the IEEE, 1998