REMOC

Abstract
The on-chip memories of GPUs, including the register file, shared memory and L1 cache, can provide high bandwidth and low latency access for the temporary storage of data. The capacity of L1 cache can be increased by using the registers/shared memory that are unassigned to any warps/thread blocks or released after warps/thread blocks are finished as cache-lines. In this paper, we propose two techniques to manage requests for on-chip memories to improve the efficiency of L1 cache on the base of leveraging registers and shared memory as cache-lines. Specifically, we develop a data transferring policy which is triggered when cache-lines are recalled by the first register or shared memory accesses of warps that are newly launched to prevent the data locality from being destroyed. Additionally, we design a parallel issue scheme by exploring the parallel feature of requests of an instruction accessing the register file, shared memory and L1 cache to decrease the processing latency and hence increase the throughput of instructions. The experimental results demonstrate that our approach improves the performance by 15% over prior work.
Funding Information
  • Fundamental Research Funds for the Central Universities of Civil Aviation University of China (3122021053)

This publication has 10 references indexed in Scilit: