Divergence-aware warp scheduling

7 December 2013

conference paper
conference paper
Published by Association for Computing Machinery (ACM) in Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture - MICRO-46

p. 99-110
https://doi.org/10.1145/2540708.2540718

Abstract

This paper uses hardware thread scheduling to improve the performance and energy efficiency of divergent applications on GPUs. We propose Divergence-Aware Warp Scheduling (DAWS), which introduces a divergence-based cache footprint predictor to estimate how much L1 data cache capacity is needed to capture intra-warp locality in loops. Predictor estimates are created from an online characterization of memory divergence and runtime information about the level of control flow divergence in warps. Unlike prior work on Cache-Conscious Wavefront Scheduling, which makes reactive scheduling decisions based on detected cache thrashing, DAWS makes proactive scheduling decisions based on cache usage predictions. DAWS uses these predictions to schedule warps such that data reused by active scalar threads is unlikely to exceed the capacity of the L1 data cache. DAWS attempts to shift the burden of locality management from software to hardware, increasing the performance of simpler and more portable code on the GPU. We compare the execution time of two Sparse Matrix Vector Multiply implementations and show that DAWS is able to run a simple, divergent version within 4% of a performance optimized version that has been rewritten to make use of the on-chip scratchpad and have less memory divergence. We show that DAWS achieves a harmonic mean 26% performance improvement over Cache-Conscious Wavefront Scheduling on a diverse selection of highly cache-sensitive applications, with minimal additional hardware.

Keywords

Funding Information

Natural Sciences and Engineering Research Council of Canada
Nvidia

This publication has 27 references indexed in Scilit:

GPUWattch
ACM SIGARCH Computer Architecture News, 2013
Orchestrated scheduling and prefetching for GPGPUs
ACM SIGARCH Computer Architecture News, 2013
CRUISE
ACM SIGARCH Computer Architecture News, 2012
Dark silicon and the end of multicore scaling
ACM SIGARCH Computer Architecture News, 2011
Energy-efficient mechanisms for managing thread context in throughput processors
ACM SIGARCH Computer Architecture News, 2011
Accelerating CUDA graph algorithms at maximum warp
ACM SIGPLAN Notices, 2011
Dynamic warp subdivision for integrated branch and memory divergence tolerance
ACM SIGARCH Computer Architecture News, 2010
High performance cache replacement using re-reference interval prediction (RRIP)
ACM SIGARCH Computer Architecture News, 2010
Adaptive insertion policies for high performance caching
ACM SIGARCH Computer Architecture News, 2007
Multiscalar processors
ACM SIGARCH Computer Architecture News, 1995

Cited by 108 articles