Pipette: Improving Core Utilization on Irregular Applications through Intra-Core Pipeline Parallelism

Abstract
Applications with irregular memory accesses and control flow, such as graph algorithms and sparse linear algebra, use high-performance cores very poorly and suffer from dismal IPC. Instruction latencies are so large that even SMT cores running multiple data-parallel threads suffer poor utilization.We find that irregular applications have abundant pipeline parallelism that can be used to boost utilization: these applications can be structured as a pipeline of stages decoupled by queues. Queues hide latency very effectively when they allow producer stages to run far ahead of consumers. Prior work has proposed decoupled architectures, such as DAE and streaming multicores, that implement queues in hardware to exploit pipeline parallelism. Unfortunately, prior decoupled architectures are ill-suited to irregular applications, as they lack the control mechanisms needed to achieve decoupling, and target decoupling across cores but suffer from poor utilization within each core due to load imbalance across stages.We present Pipette, a technique that enables cheap pipeline parallelism within each core. Pipette decouples threads within the core using architecturally visible queues. Pipette’s ISA features control mechanisms that allow effective decoupling under irregular control flow. By time-multiplexing stages on the same core, Pipette avoids load imbalance and achieves high core IPC. Pipette’s novel implementation uses the physical register file to implement queues at very low cost, putting otherwise-idle registers to use. Pipette also adds cheap hardware to accelerate common access patterns, enabling fine-grain composition of accelerated accesses and general-purpose computation. As a result, Pipette outperforms data-parallel implementations of several challenging irregular applications by gmean 1.9× (and up to 3.9×).

This publication has 52 references indexed in Scilit: