Stash
- 13 June 2015
- conference paper
- conference paper
- Published by Association for Computing Machinery (ACM)
- p. 707-719
- https://doi.org/10.1145/2749469.2750374
Abstract
Heterogeneous systems employ specialization for energy efficiency. Since data movement is expected to be a dominant consumer of energy, these systems employ specialized memories (e.g., scratchpads and FIFOs) for better efficiency for targeted data. These memory structures, however, tend to exist in local address spaces, incurring significant performance and energy penalties due to inefficient data movement between the global and private spaces. We propose an efficient heterogeneous memory system where specialized memory components are tightly coupled in a unified and coherent address space. This paper applies these ideas to a system with CPUs and GPUs with scratchpads and caches. We introduce a new memory organization, stash, that combines the benefits of caches and scratchpads without incurring their downsides. Like a scratchpad, the stash is directly addressed (without tags and TLB accesses) and provides compact storage. Like a cache, the stash is globally addressable and visible, providing implicit data movement and increased data reuse. We show that the stash provides better performance and energy than a cache and a scratchpad, while enabling new use cases for heterogeneous systems. For 4 microbenchmarks, which exploit new use cases (e.g., reuse across GPU compute kernels), compared to scratchpads and caches, the stash reduces execution cycles by an average of 27% and 13% respectively and energy by an average of 53% and 35%. For 7 current GPU applications, which are not designed to exploit the new features of the stash, compared to scratchpads and caches, the stash reduces cycles by 10% and 12% on average (max 22% and 31%) respectively, and energy by 16% and 32% on average (max 30% and 51%).Keywords
Funding Information
- Center for Future Architectures Research
- Qualcomm Innovation Fellowship
- National Science Foundation (CCF-1018796, CCF-1302641)
- Illinois Intel Parallelism Center
This publication has 32 references indexed in Scilit:
- BiNPublished by Association for Computing Machinery (ACM) ,2012
- Automatic datatype generation and optimizationPublished by Association for Computing Machinery (ACM) ,2012
- The accelerator storeACM Transactions on Architecture and Code Optimization, 2012
- DymaxionPublished by Association for Computing Machinery (ACM) ,2011
- CudaDMAPublished by Association for Computing Machinery (ACM) ,2011
- Memory allocation for embedded systems with a compile-time-unknown scratch-pad sizeACM Transactions on Embedded Computing Systems, 2009
- Dynamic allocation for scratch-pad memory using compile-time decisionsACM Transactions on Embedded Computing Systems, 2006
- SURF: Speeded Up Robust FeaturesLecture Notes in Computer Science, 2006
- Multifacet's general execution-driven multiprocessor simulator (GEMS) toolsetACM SIGARCH Computer Architecture News, 2005
- An optimal memory allocation scheme for scratch-pad-based embedded systemsACM Transactions on Embedded Computing Systems, 2002