Stash

13 June 2015

conference paper
conference paper
Published by Association for Computing Machinery (ACM)

p. 707-719
https://doi.org/10.1145/2749469.2750374

Abstract

Heterogeneous systems employ specialization for energy efficiency. Since data movement is expected to be a dominant consumer of energy, these systems employ specialized memories (e.g., scratchpads and FIFOs) for better efficiency for targeted data. These memory structures, however, tend to exist in local address spaces, incurring significant performance and energy penalties due to inefficient data movement between the global and private spaces. We propose an efficient heterogeneous memory system where specialized memory components are tightly coupled in a unified and coherent address space. This paper applies these ideas to a system with CPUs and GPUs with scratchpads and caches. We introduce a new memory organization, stash, that combines the benefits of caches and scratchpads without incurring their downsides. Like a scratchpad, the stash is directly addressed (without tags and TLB accesses) and provides compact storage. Like a cache, the stash is globally addressable and visible, providing implicit data movement and increased data reuse. We show that the stash provides better performance and energy than a cache and a scratchpad, while enabling new use cases for heterogeneous systems. For 4 microbenchmarks, which exploit new use cases (e.g., reuse across GPU compute kernels), compared to scratchpads and caches, the stash reduces execution cycles by an average of 27% and 13% respectively and energy by an average of 53% and 35%. For 7 current GPU applications, which are not designed to exploit the new features of the stash, compared to scratchpads and caches, the stash reduces cycles by 10% and 12% on average (max 22% and 31%) respectively, and energy by 16% and 32% on average (max 30% and 51%).

Keywords

Funding Information

Center for Future Architectures Research
Qualcomm Innovation Fellowship
National Science Foundation (CCF-1018796, CCF-1302641)
Illinois Intel Parallelism Center

This publication has 32 references indexed in Scilit:

BiN
Published by Association for Computing Machinery (ACM) ,2012
Automatic datatype generation and optimization
Published by Association for Computing Machinery (ACM) ,2012
The accelerator store
ACM Transactions on Architecture and Code Optimization, 2012
Dymaxion
Published by Association for Computing Machinery (ACM) ,2011
CudaDMA
Published by Association for Computing Machinery (ACM) ,2011
Memory allocation for embedded systems with a compile-time-unknown scratch-pad size
ACM Transactions on Embedded Computing Systems, 2009
Dynamic allocation for scratch-pad memory using compile-time decisions
ACM Transactions on Embedded Computing Systems, 2006
SURF: Speeded Up Robust Features
Lecture Notes in Computer Science, 2006
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset
ACM SIGARCH Computer Architecture News, 2005
An optimal memory allocation scheme for scratch-pad-based embedded systems
ACM Transactions on Embedded Computing Systems, 2002

Cited by 44 articles