Efficient Reuse Distance Analysis of Multicore Scaling for Loop-Based Parallel Programs
- 1 February 2013
- journal article
- research article
- Published by Association for Computing Machinery (ACM) in ACM Transactions on Computer Systems
- Vol. 31 (1), 1-37
- https://doi.org/10.1145/2427631.2427632
Abstract
Reuse Distance (RD) analysis is a powerful memory analysis tool that can potentially help architects study multicore processor scaling. One key obstacle, however, is that multicore RD analysis requires measuring Concurrent Reuse Distance (CRD) and Private-LRU-stack Reuse Distance (PRD) profiles across thread-interleaved memory reference streams. Sensitivity to memory interleaving makes CRD and PRD profiles architecture dependent, preventing them from analyzing different processor configurations. For loop-based parallel programs, CRD and PRD profiles shift coherently across RD values with core count scaling because interleaving threads are symmetric. Simple techniques can predict such shifting, making the analysis of numerous multicore configurations from a small set of CRD and PRD profiles feasible. Given the ubiquity of parallel loops, such techniques will be extremely valuable for studying future large multicore designs. This article investigates using RD analysis to efficiently analyze multicore cache performance for loop-based parallel programs, making several contributions. First, we provide an in-depth analysis on how CRD and PRD profiles change with core count scaling. Second, we develop techniques to predict CRD and PRD profile scaling, in particular employing reference groups [Zhong et al. 2003] to predict coherent shift, demonstrating 90% or greater prediction accuracy. Third, our CRD and PRD profile analyses define two application parameters with architectural implications: C core is the minimum shared cache capacity that “contains” locality degradation due to core count scaling, and C share is the capacity at which shared caches begin to provide a cache-miss reduction compared to private caches. And fourth, we apply CRD and PRD profiles to analyze multicore cache performance. When combined with existing problem scaling prediction, our techniques can predict shared LLC MPKI (private L2 cache MPKI) to within 10.7% (13.9%) of simulation across 1,728 (1,440) configurations using only 36 measured CRD (PRD) profiles.Keywords
Funding Information
- Defense Advanced Research Projects Agency (HR0011-10-9-0009)
- Division of Computing and Communication Foundations (CCF-1117042)
This publication has 26 references indexed in Scilit:
- Is Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors?Lecture Notes in Computer Science, 2010
- Program locality analysis using reuse distanceACM Transactions on Programming Languages and Systems, 2009
- A mechanistic performance model for superscalar out-of-order processorsACM Transactions on Computer Systems, 2009
- Using Pin as a memory reference generator for multiprocessor simulationACM SIGARCH Computer Architecture News, 2005
- Exploring the cache design space for large scale CMPsACM SIGARCH Computer Architecture News, 2005
- A NUCA substrate for flexible CMP cache sharingPublished by Association for Computing Machinery (ACM) ,2005
- PinPublished by Association for Computing Machinery (ACM) ,2005
- Predicting whole-program locality through reuse distance analysisPublished by Association for Computing Machinery (ACM) ,2003
- Analytical cache models with applications to cache partitioningPublished by Association for Computing Machinery (ACM) ,2001
- The SPLASH-2 programsPublished by Association for Computing Machinery (ACM) ,1995