Pipette: Improving Core Utilization on Irregular Applications through Intra-Core Pipeline Parallelism

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE) in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

p. 596-608
https://doi.org/10.1109/micro50266.2020.00056

Abstract

Applications with irregular memory accesses and control flow, such as graph algorithms and sparse linear algebra, use high-performance cores very poorly and suffer from dismal IPC. Instruction latencies are so large that even SMT cores running multiple data-parallel threads suffer poor utilization.We find that irregular applications have abundant pipeline parallelism that can be used to boost utilization: these applications can be structured as a pipeline of stages decoupled by queues. Queues hide latency very effectively when they allow producer stages to run far ahead of consumers. Prior work has proposed decoupled architectures, such as DAE and streaming multicores, that implement queues in hardware to exploit pipeline parallelism. Unfortunately, prior decoupled architectures are ill-suited to irregular applications, as they lack the control mechanisms needed to achieve decoupling, and target decoupling across cores but suffer from poor utilization within each core due to load imbalance across stages.We present Pipette, a technique that enables cheap pipeline parallelism within each core. Pipette decouples threads within the core using architecturally visible queues. Pipette’s ISA features control mechanisms that allow effective decoupling under irregular control flow. By time-multiplexing stages on the same core, Pipette avoids load imbalance and achieves high core IPC. Pipette’s novel implementation uses the physical register file to implement queues at very low cost, putting otherwise-idle registers to use. Pipette also adds cheap hardware to accelerate common access patterns, enabling fine-grain composition of accelerated accesses and general-purpose computation. As a result, Pipette outperforms data-parallel implementations of several challenging irregular applications by gmean 1.9× (and up to 3.9×).

Keywords

This publication has 52 references indexed in Scilit:

GraphMat
Proceedings of the VLDB Endowment, 2015
SQRL
Published by Association for Computing Machinery (ACM) ,2014
Speedy transactions in multicore in-memory databases
Published by Association for Computing Machinery (ACM) ,2013
An Efficient Unbounded Lock-Free Queue for Multi-core Systems
Lecture Notes in Computer Science, 2012
The Raw microprocessor: a computational fabric for software circuits and general-purpose programs
IEEE Micro, 2002
Multithreading decoupled architectures for complexity-effective general purpose computing
ACM SIGARCH Computer Architecture News, 2001
Performance of the decoupled ACRI-1 architecture: The perfect club
Published by Springer Science and Business Media LLC ,1995
Gang scheduling performance benefits for fine-grain synchronization
Journal of Parallel and Distributed Computing, 1992
PIPE
ACM SIGARCH Computer Architecture News, 1985
Decoupled access/execute computer architectures
ACM SIGARCH Computer Architecture News, 1982

Cited by 13 articles