TABLA: A unified template-based framework for accelerating statistical machine learning

1 March 2016

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

p. 14-26
https://doi.org/10.1109/hpca.2016.7446050

Abstract

A growing number of commercial and enterprise systems increasingly rely on compute-intensive Machine Learning (ML) algorithms. While the demand for these compute-intensive applications is growing, the performance benefits from general-purpose platforms are diminishing. Field Programmable Gate Arrays (FPGAs) provide a promising path forward to accommodate the needs of machine learning algorithms and represent an intermediate point between the efficiency of ASICs and the programmability of general-purpose processors. However, acceleration with FPGAs still requires long development cycles and extensive expertise in hardware design. To tackle this challenge, instead of designing an accelerator for a machine learning algorithm, we present TABLA, a framework that generates accelerators for a class of machine learning algorithms. The key is to identify the commonalities across a wide range of machine learning algorithms and utilize this commonality to provide a high-level abstraction for programmers. TABLA leverages the insight that many learning algorithms can be expressed as a stochastic optimization problem. Therefore, learning becomes solving an optimization problem using stochastic gradient descent that minimizes an objective function over the training data. The gradient descent solver is fixed while the objective function changes for different learning algorithms. TABLA provides a template-based framework to accelerate this class of learning algorithms. Therefore, a developer can specify the learning task by only expressing the gradient of the objective function using our high-level language. Tabla then automatically generates the synthesizable implementation of the accelerator for FPGA realization using a set of hand-optimized templates. We use Tabla to generate accelerators for ten different learning tasks targeted at a Xilinx Zynq FPGA platform. We rigorously compare the benefits of FPGA acceleration to multi-core CPUs (ARM Cortex A15 and Xeon E3) and many-core GPUs (Tegra K1, GTX 650 Ti, and Tesla K40) using real hardware measurements. TABLA-generated accelerators provide 19.4x and 2.9x average speedup over the ARM and Xeon processors, respectively. These accelerators provide 17.57x, 20.2x, and 33.4x higher Performance-per-Watt in comparison to Tegra, GTX 650 Ti and Tesla, respectively. These benefits are achieved while the programmers write less than 50 lines of code.

Keywords

This publication has 37 references indexed in Scilit:

PuDianNao
Published by Association for Computing Machinery (ACM) ,2015
Parallel architectures for the kNN classifier -- design of soft IP cores and FPGA implementations
ACM Transactions on Embedded Computing Systems, 2013
Neural Acceleration for General-Purpose Approximate Programs
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2012
Model-driven Level 3 BLAS Performance Optimization on Loongson 3A Processor
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2012
Factorization Machines with libFM
ACM Transactions on Intelligent Systems and Technology, 2012
Toward Dark Silicon in Servers
IEEE Micro, 2011
FPGA implementation of K-means algorithm for bioinformatics application: An accelerated approach to clustering Microarray data
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2011
IP-cores design for the kNN classifier
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2010
Real-time K-Means Clustering for Color Images on Reconfigurable Hardware
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2006
Gradient-based learning applied to document recognition
Proceedings of the IEEE, 1998

Cited by 101 articles