Pipette: Improving Core Utilization on Irregular Applications through Intra-Core Pipeline Parallelism
- 1 October 2020
- conference paper
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE) in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
Abstract
Applications with irregular memory accesses and control flow, such as graph algorithms and sparse linear algebra, use high-performance cores very poorly and suffer from dismal IPC. Instruction latencies are so large that even SMT cores running multiple data-parallel threads suffer poor utilization.We find that irregular applications have abundant pipeline parallelism that can be used to boost utilization: these applications can be structured as a pipeline of stages decoupled by queues. Queues hide latency very effectively when they allow producer stages to run far ahead of consumers. Prior work has proposed decoupled architectures, such as DAE and streaming multicores, that implement queues in hardware to exploit pipeline parallelism. Unfortunately, prior decoupled architectures are ill-suited to irregular applications, as they lack the control mechanisms needed to achieve decoupling, and target decoupling across cores but suffer from poor utilization within each core due to load imbalance across stages.We present Pipette, a technique that enables cheap pipeline parallelism within each core. Pipette decouples threads within the core using architecturally visible queues. Pipette’s ISA features control mechanisms that allow effective decoupling under irregular control flow. By time-multiplexing stages on the same core, Pipette avoids load imbalance and achieves high core IPC. Pipette’s novel implementation uses the physical register file to implement queues at very low cost, putting otherwise-idle registers to use. Pipette also adds cheap hardware to accelerate common access patterns, enabling fine-grain composition of accelerated accesses and general-purpose computation. As a result, Pipette outperforms data-parallel implementations of several challenging irregular applications by gmean 1.9× (and up to 3.9×).Keywords
This publication has 52 references indexed in Scilit:
- GraphMatProceedings of the VLDB Endowment, 2015
- SQRLPublished by Association for Computing Machinery (ACM) ,2014
- Speedy transactions in multicore in-memory databasesPublished by Association for Computing Machinery (ACM) ,2013
- An Efficient Unbounded Lock-Free Queue for Multi-core SystemsLecture Notes in Computer Science, 2012
- The Raw microprocessor: a computational fabric for software circuits and general-purpose programsIEEE Micro, 2002
- Multithreading decoupled architectures for complexity-effective general purpose computingACM SIGARCH Computer Architecture News, 2001
- Performance of the decoupled ACRI-1 architecture: The perfect clubPublished by Springer Science and Business Media LLC ,1995
- Gang scheduling performance benefits for fine-grain synchronizationJournal of Parallel and Distributed Computing, 1992
- PIPEACM SIGARCH Computer Architecture News, 1985
- Decoupled access/execute computer architecturesACM SIGARCH Computer Architecture News, 1982