Combining loop transformations considering caches and scheduling
- 23 December 2002
- conference paper
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
Abstract
The performance of modern microprocessors is greatly affected by cache behavior, instruction scheduling, register allocation and loop overhead. High level loop transformations such as fission, fusion, tiling, interchanging and outer loop unrolling (e.g., unroll and jam) are well known to be capable of improving all these aspects of performance. Difficulties arise because these machine characteristics and these optimizations are highly interdependent. Interchanging two loops might, for example, improve cache behavior but make it impossible to allocate registers in the inner loop. Similarly, unrolling or interchanging a loop might individually hurt performance but doing both simultaneously might help performance. Little work has been published on how to combine these transformations into an efficient and effective compiler algorithm. In this paper we present a model that estimates total machine cycle time taking into account cache misses, software pipelining, register pressure and loop overhead. We then develop an algorithm to intelligently search through the various possible transformations, using our machine model to select the set of transformations leading to the best overall performance. We have implemented this algorithm as part of the MIPSPro commercial compiler system. We give experimental results showing that our approach is both effective and efficient in optimizing numerical programs.Keywords
This publication has 10 references indexed in Scilit:
- Combining optimization for cache and instruction-level parallelismPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Software pipelining showdownPublished by Association for Computing Machinery (ACM) ,1996
- Tile size selection using cache organization and data layoutPublished by Association for Computing Machinery (ACM) ,1995
- Compiler optimizations for improving data localityPublished by Association for Computing Machinery (ACM) ,1994
- Improving the ratio of memory operations to floating-point operations in loopsACM Transactions on Programming Languages and Systems, 1994
- Efficient and exact data dependence analysisPublished by Association for Computing Machinery (ACM) ,1991
- The cache performance and optimizations of blocked algorithmsPublished by Association for Computing Machinery (ACM) ,1991
- Strategies for cache and local memory management by global program transformationJournal of Parallel and Distributed Computing, 1988
- Estimating interlock and improving balance for pipelined architecturesJournal of Parallel and Distributed Computing, 1988
- Software pipelining: an effective scheduling technique for VLIW machinesPublished by Association for Computing Machinery (ACM) ,1988