2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

Conference Information
Name: 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
Location: Athens, Greece
Date: 2020-10-17 - 2020-10-21

Latest articles from this conference

Akshay Krishna Ramanathan, Gurpreet S Kalsi, Srivatsa Srinivasa, Tarun Makesh Chandran, Kamlesh R Pillai, Om J Omer, Vijaykrishnan Narayanan, Sreenivas Subramoney
2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) pp 88-101; https://doi.org/10.1109/micro50266.2020.00020

This paper presents a Look-Up Table (LUT) based Processing-In-Memory (PIM) technique with the potential for running Neural Network inference tasks. We implement a bitline computing free technique to avoid frequent bitline accesses to the cache sub-arrays and thereby considerably reducing the memory access energy overhead. LUT in conjunction with the compute engines enables sub-array level parallelism while executing complex operations through data lookup which otherwise requires multiple cycles. Sub-array level parallelism and systolic input data flow ensure data movement to be confined to the SRAM slice.Our proposed LUT based PIM methodology exploits substantial parallelism using look-up tables, which does not alter the memory structure/organization, that is, preserving the bit-cell and peripherals of the existing SRAM monolithic arrays. Our solution achieves 1.72x higher performance and 3.14x lower energy as compared to a state-of-the-art processing-in-cache solution. Sub-array level design modifications to incorporate LUT along with the compute engines will increase the overall cache area by 5.6%. We achieve 3.97x speedup w.r.t neural network systolic accelerator with a similar area. The re-configurable nature of the compute engines enables various neural network operations and thereby supporting sequential networks (RNNs) and transformer models. Our quantitative analysis demonstrates 101x, 3x faster execution and 91x, 11x energy efficient than CPU and GPU respectively while running the transformer model, BERT-Base.
Gil Shomron, Uri Weiser
2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) pp 256-269; https://doi.org/10.1109/micro50266.2020.00032

Deep neural networks (DNNs) are known for their inability to utilize underlying hardware resources due to hard-ware susceptibility to sparse activations and weights. Even in finer granularities, many of the non-zero values hold a portion of zero-valued bits that may cause inefficiencies when executed on hard-ware. Inspired by conventional CPU simultaneous multithreading (SMT) that increases computer resource utilization by sharing them across several threads, we propose non-blocking SMT (NB-SMT) designated for DNN accelerators. Like conventional SMT, NB-SMT shares hardware resources among several execution flows. Yet, unlike SMT, NB-SMT is non-blocking, as it handles structural hazards by exploiting the algorithmic resiliency of DNNs. Instead of opportunistically dispatching instructions while they wait in a reservation station for available hardware, NB-SMT temporarily reduces the computation precision to accommodate all threads at once, enabling a non-blocking operation. We demonstrate NB-SMT applicability using SySMT, an NB-SMT-enabled output-stationary systolic array (OS-SA). Compared with a conventional OS-SA, a 2-threaded SySMT consumes 1.4× the area and delivers 2× speedup with 33% energy savings and less than 1% accuracy degradation of state-of-the-art CNNs with ImageNet. A 4-threaded SySMT consumes 2.5× the area and delivers, for example, 3.4× speedup and 39%×energy savings with 1% accuracy degradation of 40%-pruned ResNet-18.
Liu Liu, Zheng Qu, Lei Deng, Fengbin Tu, Shuangchen Li, Xing Hu, Zhenyu Gu, Yufei Ding, Yuan Xie
2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) pp 738-750; https://doi.org/10.1109/micro50266.2020.00066

Deep Neural Networks (DNNs) have been driving the mainstream of Machine Learning applications. However, deploying DNNs on modern hardware with stringent latency requirements and energy constraints is challenging because of the compute-intensive and memory-intensive execution patterns of various DNN models. We propose an algorithm-architecture co-design to boost DNN execution efficiency. Leveraging the noise resilience of nonlinear activation functions in DNNs, we propose dual-module processing that uses approximate modules learned from original DNN layers to compute insensitive activations. Therefore, we can save expensive computations and data accesses of unnecessary sensitive activations. We then design an Executor-Speculator dual-module architecture with support for balance execution and memory access reduction. With acceptable model inference quality degradation, our accelerator design can achieve 2.24x speedup and 1.97x energy efficiency improvement for compute-bound Convolutional Neural Networks (CNNs) and memory-bound Recurrent Neural Networks (RNNs).
Quan M. Nguyen, Daniel Sanchez
2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) pp 596-608; https://doi.org/10.1109/micro50266.2020.00056

Applications with irregular memory accesses and control flow, such as graph algorithms and sparse linear algebra, use high-performance cores very poorly and suffer from dismal IPC. Instruction latencies are so large that even SMT cores running multiple data-parallel threads suffer poor utilization.We find that irregular applications have abundant pipeline parallelism that can be used to boost utilization: these applications can be structured as a pipeline of stages decoupled by queues. Queues hide latency very effectively when they allow producer stages to run far ahead of consumers. Prior work has proposed decoupled architectures, such as DAE and streaming multicores, that implement queues in hardware to exploit pipeline parallelism. Unfortunately, prior decoupled architectures are ill-suited to irregular applications, as they lack the control mechanisms needed to achieve decoupling, and target decoupling across cores but suffer from poor utilization within each core due to load imbalance across stages.We present Pipette, a technique that enables cheap pipeline parallelism within each core. Pipette decouples threads within the core using architecturally visible queues. Pipette’s ISA features control mechanisms that allow effective decoupling under irregular control flow. By time-multiplexing stages on the same core, Pipette avoids load imbalance and achieves high core IPC. Pipette’s novel implementation uses the physical register file to implement queues at very low cost, putting otherwise-idle registers to use. Pipette also adds cheap hardware to accelerate common access patterns, enabling fine-grain composition of accelerated accesses and general-purpose computation. As a result, Pipette outperforms data-parallel implementations of several challenging irregular applications by gmean 1.9× (and up to 3.9×).
Brian C. Schwedock, Nathan Beckmann
2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) pp 665-680; https://doi.org/10.1109/micro50266.2020.00061

The datacenter introduces new challenges for computer systems around tail latency and security. This paper argues that dynamic NUCA techniques are a better solution to these challenges than prior cache designs. We show that dynamic NUCA designs can meet tail-latency deadlines with much less cache space than prior work, and that they also provide a natural defense against cache attacks. Unfortunately, prior dynamic NUCAs have missed these opportunities because they focus exclusively on reducing data movement.We present Jumanji, a dynamic NUCA technique designed for tail latency and security. We show that prior last-level cache designs are vulnerable to new attacks and offer imperfect performance isolation. Jumanji solves these problems while significantly improving performance of co-running batch applications. Moreover, Jumanji only requires lightweight hardware and a few simple changes to system software, similar to prior D-NUCAs. At 20 cores, Jumanji improves batch weighted speedup by 14% on average, vs. just 2% for a non-NUCA design with weaker security, and is within 2% of an idealized design.
Liang Zhou, Laxmi N. Bhuyan, K. K. Ramakrishnan
2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) pp 637-349; https://doi.org/10.1109/micro50266.2020.00059

Saving energy for latency-critical applications like web search can be challenging because of their strict tail latency constraints. State-of-the-art power management frameworks use Dynamic Voltage and Frequency Scaling (DVFS) and Sleep states techniques to slow down the request processing and finish the search just-in-time. However, accurately predicting the compute demand of a request can be difficult. In this paper, we present Gemini, a novel power management framework for latency-critical search engines. Gemini has two unique features to capture the per query service time variation. First, at light loads without request queuing, a two-step DVFS is used to manage the CPU power. Our two-step DVFS selects the initial CPU frequency based on the query specific service time prediction and then judiciously boosts the initial frequency at the right time to catch-up to the deadline. The determination of boosting time further relies on estimating the error in the prediction of individual query’s service time. At high loads, where there is request queuing, only the current request being executed and the critical request in the queue adopt a two-step DVFS. All the other requests in-between use the same frequency to reduce the frequency transition overhead. Second, we develop two separate neural network models, one for predicting the service time and the other for the error in the prediction. The combination of these two predictors significantly improves the power saving and tail latency results of our two-step DVFS. Gemini is implemented on the Solr search engine. Evaluations on three representative query traces show that Gemini saves 41% of the CPU power, and is better than other state-of-the-art techniques.
Yaohua Wang, Lois Orosa, Xiangjun Peng, Yang Guo, Saugata Ghose, Minesh Patel, Jeremie S. Kim, Juan Gomez Luna, Mohammad Sadrosadati, Nika Mansouri Ghiasi, et al.
2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) pp 313-328; https://doi.org/10.1109/micro50266.2020.00036

Main memory, composed of DRAM, is a performance bottleneck for many applications, due to the high DRAM access latency. In-DRAM caches work to mitigate this latency by augmenting regular-latency DRAM with small-but-fast regions of DRAM that serve as a cache for the data held in the regular-latency (i.e., slow) region of DRAM. While an effective in-DRAM cache can allow a large fraction of memory requests to be served from a fast DRAM region, the latency savings are often hindered by inefficient mechanisms for migrating (i.e., relocating) copies of data into and out of the fast regions. Existing in-DRAM caches have two sources of inefficiency: (1) their data relocation granularity is an entire multi-kilobyte row of DRAM, even though much of the row may never be accessed due to poor data locality; and (2) because the relocation latency increases with the physical distance between the slow and fast regions, multiple fast regions are physically interleaved among slow regions to reduce the relocation latency, resulting in increased hardware area and manufacturing complexityWe propose a new substrate, FIGARO, that uses existing shared global buffers among subarrays within a DRAM bank to provide support for in-DRAM data relocation across subar-rays at the granularity of a single cache block. FIGARO has a distance-independent latency within a DRAM bank, and avoids complex modifications to DRAM (such as the interleaving of fast and slow regions). Using FIGARO, we design a fine-grained in-DRAM cache called FIGCache. The key idea of FIGCache is to cache only small, frequently-accessed portions of different DRAM rows in a designated region of DRAM. By caching only the parts of each row that are expected to be accessed in the near future, we can pack more of the frequently-accessed data into FIGCache, and can benefit from additional row hits in DRAM (i.e., accesses to an already-open row, which have a lower latency than accesses to an unopened row). FIGCache provides benefits for systems with both heterogeneous DRAM banks (i.e., banks with fast regions and slow regions) and conventional homogeneous DRAM banks (i.e., banks with only slow regions)Our evaluations across a wide variety of applications show that FIGCache improves the average performance of a system using DDR4 DRAM by 16.3% and reduces average DRAM energy consumption by 7.8% for 8-core workloads, over a conventional system without in-DRAM caching. We show that FIGCache outperforms state-of-the-art in-DRAM caching techniques, and that its performance gains are robust across many system and mechanism parameters.
Mahabubul Alam, Abdullah Ash-Saki, Swaroop Ghosh
2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) pp 215-228; https://doi.org/10.1109/micro50266.2020.00029

The quantum approximate optimization algorithm (QAOA) is a promising quantum-classical hybrid algorithm to solve hard combinatorial optimization problems. The multi-qubit CPHASE gates used in the quantum circuit for QAOA are commutative i.e., the order of the gates can be altered without changing the output state. This re-ordering leads to the execution of more gates in parallel and a smaller number of additional SWAP gates to compile the QAOA-circuit. Consequently, the circuit-depth and cumulative gate-count become lower which is beneficial for circuit execution time and noise resilience. A less number of gates indicates a lower accumulation of gate-errors, and a reduced circuit-depth means less decoherence time for the qubits. However, finding the best-ordered circuit is a difficult problem and does not scale well with circuit size. This paper presents four generic methodologies to optimize QAOA-circuits by exploiting gate re-ordering. We demonstrate a reduction in gate-count by ≈23.0% and circuit-depth by ≈53.0% on average over a conventional approach without incurring any compilation-time penalty. We also present a variation-aware compilation which enhances the compiled circuit success probability by ≈62.7% for the target hardware over the variation unaware approach. A new metric, Approximation Ratio Gap (ARG), is proposed to validate the quality of the compiled QAOA-circuit instances on actual devices. Hardware implementation of a number of QAOA instances shows ≈25.8% improvement in the proposed metric on average over the conventional approach on ibmq 16 melbourne.
Samira Mirbagher-Ajorpaz, Elba Garza, Gilles Pokam, Daniel A. Jimenez
2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) pp 131-145; https://doi.org/10.1109/micro50266.2020.00023

Translation Lookaside Buffers (TLBs) play a critical role in hardware-supported memory virtualization. To speed up address translation and reduce costly page table walks, TLBs cache a small number of recently-used virtual-to-physical address translations. TLBs must make the best use of their limited capacities. Thus, TLB entries with low potential for reuse should be replaced by more useful entries. This paper contributes to an aspect of TLB management that has received little attention in the literature: replacement policy. We show how predictive replacement policies can be tailored toward TLBs to reduce miss rates and improve overall performance.We begin by applying recently proposed predictive cache replacement policies to the TLB. We show these policies do not work well without considering specific TLB behavior. Next, we introduce a novel TLB-focused predictive policy, Control-flow History Reuse Prediction (CHIRP). This policy uses a history signature and replacement algorithm that correlates to known TLB behavior, outperforming other policies.For a 1024-entry 8-way set-associative L2 TLB with a 4KB page size, we show that CHiRP reduces misses per 1000 instructions (MPKI) by an average 28.21% over the least-recently-used (LRU) policy, outperforming Static Re-reference Interval Prediction (SRRIP) [1], Global History Reuse Policy (GHRP) [2] and SHiP [3], which reduce MPKI by an average of 10.36%, 9.03% and 0.88%, respectively.
Jawad Haj-Yahya, Mohammed Alser, Jeremie S. Kim, Lois Orosa, Efraim Rotem, Avi Mendelson, Anupam Chattopadhyay, Onur Mutlu
2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) pp 1051-1066; https://doi.org/10.1109/micro50266.2020.00088

Modern client processors typically use one of three commonly-used power delivery network (PDN) architectures: 1) motherboard voltage regulators (MBVR), 2) integrated voltage regulators (IVR), and 3) low dropout voltage regulators (LDO). We observe that the energy-efficiency of each of these PDNs varies with the processor power (e.g, thermal design power (TDP) and dynamic power-state) and workload characteristics (e.g., work-load type and computational intensity). This leads to energy-inefficiency and performance loss, as modern client processors operate across a wide spectrum of power consumption and execute a wide variety of workloads.To address this inefficiency, we propose FlexWatts, a hybrid adaptive PDN for modern client processors whose goal is to provide high energy-efficiency across the processor’s wide range of power consumption and workloads. FlexWatts provides high energy-efficiency by intelligently and dynamically allocating PDNs to processor domains depending on the processor’s power consumption and workload. FlexWatts is based on three key ideas. First, FlexWatts combines IVRs and LDOs in a novel way to share multiple on-chip and off-chip resources and thus reduce cost, as well as board and die area overheads. This hybrid PDN is allocated for processor domains with a wide power consumption range (e.g., CPU cores and graphics engines) and it dynamically switches between two modes: IVR-Mode and LDO-Mode, depending on the power consumption. Second, for all other processor domains (that have a low and narrow power range, e.g., the IO domain), FlexWatts statically allocates off-chip VRs, which have high energy-efficiency for low and narrow power ranges. Third, FlexWatts introduces a novel prediction algorithm that automatically switches the hybrid PDN to the mode (IVR-Mode or LDO-Mode) that is the most beneficial based on processor power consumption and workload characteristics.To evaluate the tradeoffs of PDNs, we develop and open-source PDNspot, the first validated architectural PDN model that enables quantitative analysis of PDN metrics. Using PDNspot, we evaluate FlexWatts on a wide variety of SPEC CPU2006, graphics (3DMark06), and battery life (e.g., video playback) workloads against IVR, the state-of-the-art PDN in modern client processors. For a 4 W thermal design power (TDP) processor, FlexWatts improves the average performance of the SPEC CPU2006 and 3DMark06 workloads by 22% and 25%, respectively. For battery life workloads, FlexWatts reduces the average power consumption of video playback by 11% across all tested TDPs (4W–50W). FlexWatts has comparable cost and area overhead to IVR. We conclude that FlexWatts provides high energy-efficiency across a modern client processor’s wide range of power consumption and wide variety of workloads, with minimal overhead.
Back to Top Top