Data and program restructuring of irregular applications for cache-coherent multiprocessor

Abstract
Applications with irregular data structures such as sparse matrices or finite element meshes account for a large fraction of engineering and scientific applications. Domain decomposition techniques are commonly used to partition these applications to reduce interprocessor communication on message passing parallel systems. Our work investigates the use of domain decomposition techniques on cache-coherent parallel systems.Many good domain decomposition algorithms are now available. We show that further application improvements are attainable using data and program restructuring in conjunction with domain decomposition. We give techniques for data layout to reduce communication, blocking with subdomains to improve uniprocessor cache behavior, and insertion of prefetches to hide the latency of interprocessor communication.This paper details our restructuring techniques and provides experimental results on the KSR1 multiprocessor for a sparse matrix application. The experimental results include counts of cache misses provided by the KSR PMON performance monitoring tool. Our data show that cache coherency traffic can be reduced by 30%–60% using our data layout scheme and that more than 53% of the remaining coherency cache misses can be eliminated using prefetch instructions.