2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)

Conference Information
Name: 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)
Location: Valencia, Spain
Date: 2020-5-30 - 2020-6-3

Latest articles from this conference

Dimitrios Skarlatos, Umur Darbaz, Bhargava Gopireddy, Nam Sung Kim, Josep Torrellas
2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA) pp 501-514; https://doi.org/10.1109/isca45697.2020.00049

Cloud computing has begun a transformation from using virtual machines to containers. Containers are attractive because multiple of them can share a single kernel, and add minimal performance overhead. Cloud providers leverage the lean nature of containers to run hundreds of them on a few cores. Furthermore, containers enable the serverless paradigm, which leads to the creation of short-lived processes.In this work, we identify that containerized environments create page translations that are extensively replicated across containers in the TLB and in page tables. The result is high TLB pressure and redundant kernel work during page table management. To remedy this situation, this paper proposes BabelFish, a novel architecture to share page translations across containers in the TLB and in page tables. We evaluate BabelFish with simulations of an 8-core processor running a set of Docker containers in an environment with conservative container co-location. On average, under BabelFish, 53% of the translations in containerized workloads and 93% of the translations in serverless workloads are shared. As a result, BabelFish reduces the mean and tail latency of containerized data-serving workloads by 11% and 18%, respectively. It also lowers the execution time of containerized compute workloads by 11%. Finally, it reduces serverless function bring-up time by 8% and execution time by 10%–55%.
Yifan Yang, Zhaoshi Li, Yangdong Deng, Zhiwei Liu, Shouyi Yin, Shaojun Wei, Leibo Liu
2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA) pp 419-432; https://doi.org/10.1109/isca45697.2020.00043

It is of vital importance to efficiently process large graphs for many data-intensive applications. As a result, a large collection of graph analytic frameworks has been proposed to improve the per-iteration performance on a single kind of computation resource. However, heavy coordination and synchronization overhead make it hard to scale out graph analytic frameworks from single platform to heterogeneous platforms. Furthermore, increasing the convergence rate, i.e. reducing the number of iterations, which is equally vital for improving the overall performance of iterative graph algorithms, receives much less attention.In this paper, we introduce the Block Coordinate Descent (BCD) view of graph algorithms and propose an asynchronous heterogeneous graph analytic framework, GraphABCD, using the BCD view. The BCD view offers key insights and trade-offs on achieving high convergence rate of iterative graph algorithms. GraphABCD features fast convergence under the algorithm design options suggested by BCD. GraphABCD offers algorithm and architectural supports for asynchronous execution, without undermining its fast convergence properties. With minimum synchronization overhead, GraphABCD is able to scale out to heterogeneous and distributed accelerators efficiently. To demonstrate GraphABCD, we prototype its whole system on Intel HARPv2 CPU-FPGA heterogeneous platform. Evaluations on HARPv2 show that GraphABCD achieves geo-mean speedups of 4.8x and 2.0x over GraphMat, a state-of-the-art framework in terms of convergence rate and execution time, respectively.
Jian Zhou, Amro Awad, Jun Wang
2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA) pp 597-609; https://doi.org/10.1109/isca45697.2020.00056

Bulk operations, such as Copy-on-Write (CoW), have been heavily used in most operating systems. In particular, CoW brings in significant savings in memory space and improvement in performance. CoW mainly relies on the fact that many allocated virtual pages are not written immediately (if ever written). Thus, assigning them to a shared physical page can eliminate much of the copy/initialization overheads in addition to improving the memory space efficiency. By prohibiting writes to the shared page, and merely copying the page content to a new physical page at the first write, CoW achieves significant performance and memory space advantages.Unfortunately, with the limited write bandwidth and slow writes of emerging Non-Volatile Memories (NVMs), such bulk writes can throttle the memory system. Moreover, it can add significant delays on the first write access to each page due to the need to copy or initialize a new page. Ideally, we need to enable CoW at fine-granularity, and hence only the updated cache blocks within the page need to be copied. To do this, we propose Lelantus, a novel approach that leverages secure memory metadata to allow fine-granularity CoW operations. Lelantus relies on a novel hardware-software co-design to allow tracking updated blocks of copied pages and hence delay the copy of the rest of the blocks until written. The impact of Lelantus becomes more significant when huge pages are deployed, e.g., 2MB or 1GB, as expected with emerging NVMs.
Alexey Lavrov, David Wentzlaff
2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA) pp 487-500; https://doi.org/10.1109/isca45697.2020.00048

Hardware resource sharing has proven to be an efficient way to increase resource utilization, save energy, and decrease operational cost. Modern-day servers accommodate hundreds of Virtual Machines (VMs) running concurrently, and lightweight software abstractions like containers enable the consolidation of an even larger number of independent tenants per server. The increasing number of hardware accelerators along with growing interconnection bandwidth creates a new class of devices available for sharing. To fully utilize the potential of these devices, I/O architecture needs to be carefully designed for both processors and devices. This paper presents the design and analysis of scalable Hypertenant TRanslation of I/O addresses (HyperTRIO) for shared devices. HyperTRIO provides isolation and performance guarantees at low hardware cost by supporting multiple in-flight address translations, partitioning translation caches, and utilizing both inter- and intra-tenant access patterns for translation prefetching. This work also constructs a Hyper-tenant Simulator of I/O address accesses (HyperSIO) for 1000-tenant systems which we open-sourced. This work characterizes tenant access patterns and uses these insights to address identified challenges. Overall, the HyperTRIO design enables the system to utilize full available I/O bandwidth in a hyper-tenant environment.
Weitao Li, Pengfei Xu, Yang Zhao, Haitong Li, Yuan Xie, Yingyan Lin
2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA) pp 832-845; https://doi.org/10.1109/isca45697.2020.00073

Resistive-random-access-memory (ReRAM) based processing-in-memory $(\mathrm{R}^{2}\mathrm{P}\mathrm{I}\mathrm{M})$ accelerators show promise in bridging the gap between Internet of Thing devices’ constrained resources and Convolutional/Deep Neural Networks’ (CNNs/DNNs’) prohibitive energy cost. Specifically, $\mathrm{R}^{2}\mathrm{P}\mathrm{I}\mathrm{M}$ accelerators enhance energy efficiency by eliminating the cost of weight movements and improving the computational density through ReRAM’s high density. However, the energy efficiency is still limited by the dominant energy cost of input and partial sum (Psum) movements and the cost of digital-to-analog (D/A) and analog-to-digital (AD) interfaces. In this work, we identify three energy-saving opportunities in $\mathrm{R}^{2}\mathrm{P}\mathrm{I}\mathrm{M}$ accelerators: analog data locality, time-domain interfacing, and input access reduction, and propose an innovative $\mathrm{R}^{2}\mathrm{P}\mathrm{I}\mathrm{M}$ accelerator called TIMELY, with three key contributions: (1) TIMELY adopts analog local buffers (ALBs) within ReRAM crossbars to greatly enhance the data locality, minimizing the energy overheads of both input and Psum movements; (2) TIMELY largely reduces the energy of each single D/A (and AD) conversion and the total number of conversions by using time-domain interfaces (TDIs) and the employed ALBs, respectively; (3) we develop an only-once input read $(\mathrm{O}^{2}\mathrm{I}\mathrm{R})$ mapping method to further decrease the energy of input accesses and the number of D/A conversions. The evaluation with more than 10 CNN/DNN models and various chip configurations shows that, TIMELY outperforms the baseline $\mathrm{R}^{2}\mathrm{P}\mathrm{I}\mathrm{M}$ accelerator, PRIME, by one order of magnitude in energy efficiency while maintaining better computational density (up to $31.2\times$) and throughput (up to $736.6\times$). Furthermore, comprehensive studies are performed to evaluate the effectiveness of the proposed ALB, TDI, and $\mathrm{O}^{2}\mathrm{I}\mathrm{R}$ in terms of energy savings and area reduction.
Ben Feinberg, Benjamin C. Heyman, Darya Mikhailenko, Ryan Wong, An C. Ho, Engin Ipek
2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA) pp 1076-1088; https://doi.org/10.1109/isca45697.2020.00091

Data movement is a significant and growing consumer of energy in modern systems, from specialized low-power accelerators to GPUs with power budgets in the hundreds of Watts. Given the importance of the problem, prior work has proposed designing interconnects on which the energy cost of transmitting a 0 is significantly lower than that of transmitting a 1. With such an interconnect, data movement energy is reduced by encoding the transmitted data such that the number of 1s is minimized. Although promising, these data encoding proposals do not take full advantage of application level semantics. As an example of a neglected optimization opportunity, consider the case of a dot product computation as part of a neural network inference task. The order in which the neural network weights are fetched and processed does not affect correctness, and can be optimized to further reduce data movement energy.This paper presents commutative data reordering (CDR), a hardware-software approach that leverages the commutative property in linear algebra to strategically select the order in which weight matrix coefficients are fetched from memory. To find a low-energy transmission order, weight ordering is modeled as an instance of one of two well-studied problems, the Traveling Salesman Problem and the Capacitated Vehicle Routing Problem. This reduction makes it possible to leverage the vast body of work on efficient approximation methods to find a good transmission order. CDR exploits the indirection inherent to sparse matrix formats such that no additional metadata is required to specify the selected order. The hardware modifications required to support CDR are minimal, and incur an area penalty of less than 0.01% when implemented on top of a mobile-class GPU. When applied to 7 neural network inference tasks running on a GPU-based system, CDR respectively reduces average DRAM IO energy by 53.1% and 22.2% over the data bus invert encoding scheme used by LPDDR4, and the recently proposed Base + XOR encoding. These savings are attained with no changes to the mobile system software and no runtime performance penalty.
Adarsh Chauhan, Jayesh Gaur, Zeev Sperber, Franck Sala, Lihu Rappoport, Adi Yoaz, Sreenivas Subramoney
2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA) pp 92-104; https://doi.org/10.1109/isca45697.2020.00019

Advancements in branch predictors have allowed modern processors to aggressively speculate and gain significant performance with every generation of increasing out-of-order depth and width. Unfortunately, there are branches that are still hard-to-predict (H2P) and mis-speculation on these branches is severely limiting the performance scalability of future processors. One potential solution to mitigate this problem is to predicate branches by substituting control dependencies with data dependencies. Predication is very costly for performance as it inhibits instruction level parallelism. To overcome this limitation, prior works selectively applied predication at run-time on H2P branches that have low confidence of branch prediction. However, these schemes do not fully comprehend the delicate trade-offs involved in suppressing speculation and can suffer from performance degradation on certain workloads. Additionally, they need significant changes not just to the hardware but also to the compiler and the instruction set architecture, rendering their implementation complex and challenging.In this paper, by analyzing the fundamental trade-offs between branch prediction and predication, we propose Auto-Predication of Critical Branches (ACB) — an end-to-end hardware-based solution that intelligently disables speculation only on branches that are critical for performance. Unlike existing approaches, ACB uses a sophisticated performance monitoring mechanism to gauge the effectiveness of dynamic predication, and hence does not suffer from performance inversions. Our simulation results show that, with just 386 bytes of additional hardware and no software support, ACB delivers 8% performance gain over a baseline similar to the Skylake processor. We also show that ACB reduces pipeline flushes because of mis-speculations by 22%, thus effectively helping both power and performance.
Joohyeong Yoon, Won Seob Jeong, Won Woo Ro
2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA) pp 693-706; https://doi.org/10.1109/isca45697.2020.00063

Persistent key-value store supports journaling and checkpointing to maintain data consistency and to prevent data loss. However, conventional data consistency mechanisms are not suitable for efficient management of flash memories in SSDs due to that they write the same data twice and induce redundant flash operations. As a result, query processing is delayed by heavy traffics during checkpointing. The checkpointing accompanies many write operations by nature, and a write operation consumes severe time and energy in SSDs; worse, it can introduce the write amplification problem and shorten the lifetime of the flash memory. In this paper, we propose an in-storage checkpointing mechanism, named Check-In, based on the cooperation between the storage engine of a host and the flash translation layer (FTL) of an SSD. Compared to the existing mechanism, our proposed mechanism reduces the tail latency due to checkpointing by 92.1 % and reduces the number of duplicate writes by 94.3 %. Overall, the average throughput and latency are improved by 8.1 % and 10.2 %, respectively.
Back to Top Top