ACM Transactions on Architecture and Code Optimization

Journal Information
ISSN / EISSN : 1544-3566 / 1544-3973
Total articles ≅ 812
Current Coverage
SCOPUS
EI COMPENDEX
SCIE
INSPEC
Archived in
SHERPA/ROMEO
Filter:

Latest articles in this journal

, Xueying Wang, Xiaobing Chen, , Xiao Dong, , Xianzhi Yu, Yongxin Yang, Wei Cao, Lei Liu, et al.
ACM Transactions on Architecture and Code Optimization; https://doi.org/10.1145/3535355

Abstract:
Deep Neural Networks (DNNs) tend to go deeper and wider, which poses a significant challenge to the training of DNNs, due to the limited memory capacity of DNN accelerators. Existing solutions for memory-efficient DNN training are densely coupled with the application features of DNN workloads, e.g., layer structures or computational graphs of DNNs are necessary for these solutions. This would result in weak versatility for DNNs with sophisticated layer structures or complicated computation graphs. These schemes usually need to be re-implemented or re-adapted due to the new layer structures or the unusual operators in the computational graphs introduced by these DNNs. In this paper, we review the memory pressure issues of DNN training from the perspective of runtime systems and model the memory access behaviors of DNN workloads. We identify the iterative, regularity and extremalization properties of memory access patterns for DNN workloads. Based on these observations, we propose AppObMem, an application-oblivious memory scheduling system. AppObMem automatically traces the memory behaviors of DNN workloads and schedules the memory swapping to reduce the memory pressure of the device accelerators without the perception of high-level information of layer structures or computation graphs. Evaluations on a variety of DNN models show that, AppObMem obtains 40%-60% memory savings with acceptable performance loss. AppObMem is also competitive with other open-sourced SOTA schemes.
M. Ben Olson, Brandon Kammerdiener, Michael R. Jantz, Kshitij A. Doshi, Terry Jones
ACM Transactions on Architecture and Code Optimization; https://doi.org/10.1145/3533855

Abstract:
As scaling of conventional memory devices has stalled, many high end computing systems have begun to incorporate alternative memory technologies to meet performance goals. Since these technologies present distinct advantages and tradeoffs compared to conventional DDR* SDRAM, such as higher bandwidth with lower capacity or vice versa, they are typically packaged alongside conventional SDRAM in a heterogeneous memory architecture. To utilize the different types of memory efficiently, new data management strategies are needed to match application usage to the best available memory technology. However, current proposals for managing heterogeneous memories are limited because they either: 1) do not consider high-level application behavior when assigning data to different types of memory, or 2) require separate program execution (with a representative input) to collect information about how the application uses memory resources. This work presents a new data management toolset to address the limitations of existing approaches for managing complex memories. It extends the application runtime layer with automated monitoring and management routines that assign application data to the best tier of memory based on previous usage, without any need for source code modification or a separate profiling run. It evaluates this approach on a state-of-the-art server platform with both conventional DDR4 SDRAM and non-volatile Intel ® Optane TM DC memory, using both memory-intensive high performance computing (HPC) applications as well as standard benchmarks. Overall, the results show that this approach improves program performance significantly compared to a standard unguided approach across a variety of workloads and system configurations. The HPC applications exhibit the largest benefits, with speedups ranging from 1.4 x to 7 x in the best cases. Additionally, we show that this approach achieves similar performance as a comparable offline profiling-based approach after a short startup period, without requiring separate program execution or offline analysis steps.
Bruno Chinelato Honorio, João Paulo Labegalini de Carvalho, Catalina Munoz Morales, Alexandro Baldassin, Guido Araujo
ACM Transactions on Architecture and Code Optimization; https://doi.org/10.1145/3533318

Abstract:
With chip manufacturers such as Intel, IBM and ARM offering native support for transactional memory in their instruction set architectures, memory transactions are on the verge of being considered a genuine application tool rather than just an interesting research topic. Despite this recent increase in popularity on the hardware side of transactional memory (HTM), software support for transactional memory (STM) is still scarce and the only compiler with transactional support currently available, the GNU Compiler Collection (GCC), does not generate code that achieves desirable performance. For hybrid solutions of TM (HyTM), which are frameworks that leverage the best aspects of HTM and STM, the subpar performance of the software side, caused by inefficient compiler generated code, might forbid HyTM to offer optimal results. This article extends previous work focused exclusively on STM implementations by presenting a detailed analysis of transactional code generated by GCC in the context of HybridTM implementations. In particular, it builds on previous research of transactional memory support in the Clang/LLVM compiler framework, which is decoupled from any TM runtime, and presents the following novel contributions: (a) it shows that STM’s performance overhead, due to an excessive amount of read and write barriers added by the compiler, also impacts the performance of HyTM systems; (b) it reveals the importance of the previously proposed annotation mechanism to reduce the performance gap between HTM and STM in phased runtime systems. Furthermore, it shows that, by correctly using the annotations on just a few lines of code, it is possible to reduce the total number of instrumented barriers by 95% and to achieve speed-ups of up to 7x when compared to the original code generated by GCC and the Clang compiler.
Johnathan Alsop, Weon Taek Na, Matthew D. Sinclair, Samuel Grayson, Sarita Adve
ACM Transactions on Architecture and Code Optimization; https://doi.org/10.1145/3530819

Abstract:
Abstract - Hardware specialization is becoming a key enabler of energy-efficient performance. Future systems will be increasingly heterogeneous, integrating multiple specialized and programmable accelerators, each with different memory demands. Traditionally, communication between accelerators has been inefficient, typically orchestrated through explicit DMA transfers between different address spaces. More recently, industry has proposed unified coherent memory which enables implicit data movement and more data reuse, but often these interfaces limit the coherence flexibility available to heterogeneous systems. This paper demonstrates the benefits of fine-grained coherence specialization for heterogeneous systems. We propose an architecture that enables low-complexity independent specialization of each individual coherence request in heterogeneous workloads by building upon a simple and flexible baseline coherence interface, Spandex. We then describe how to optimize individual memory requests to improve cache reuse and performance-critical memory latency in emerging heterogeneous workloads. Collectively, our techniques enable significant gains, reducing execution time by up to 61% or network traffic by up to 99% while adding minimal complexity to the Spandex protocol.
Mohammadreza Soltaniyeh, Richard P. Martin,
ACM Transactions on Architecture and Code Optimization; https://doi.org/10.1145/3532863

Abstract:
This paper proposes a novel hardware accelerator for the inference task with sparse convolutional neural networks (CNNs) by building a hardware unit to perform Image to Column ( Im2Col ) transformation of the input feature map coupled with a systolic array-based general matrix-matrix multiplication (GEMM) unit. Our design carefully overlaps the Im2Col transformation with the GEMM computation to maximize parallelism. We propose a novel design for the Im2Col unit that uses a set of distributed local memories connected by a ring network, which improves energy efficiency and latency by streaming the input feature map only once. The systolic array-based GEMM unit in the accelerator can be dynamically configured as multiple GEMM units with square-shaped systolic arrays or as a single GEMM unit with a tall systolic array. This dynamic reconfigurability enables effective pipelining of Im2Col and GEMM operations and attains high processing element utilization for a wide range of CNNs. Further, our accelerator is sparsity-aware, improving performance and energy efficiency by effectively mapping the sparse feature maps and weights to the processing elements, skipping ineffectual operations and unnecessary data movements involving zeros. Our prototype, SPOTS, is on average 2.16 ×, 1.74 ×, and 1.63 × faster than Gemmini, Eyeriss, and Sparse-PE, which are prior hardware accelerators for dense and sparse CNNs, respectively. SPOTS is also 78 ×, and 12 × more energy-efficient when compared to CPU and GPU implementations, respectively.
David Corbalán-Navarro, Juan L. Aragón, Martí Anglada, Joan-Manuel Parcerisa, Antonio González
ACM Transactions on Architecture and Code Optimization; https://doi.org/10.1145/3527861

Abstract:
This paper proposes a novel micro-architecture approach for mobile GPUs aimed at early removing the occluded geometry in a scene by leveraging frame-to-frame coherence, thus reducing the overall energy consumption. Mobile GPUs commonly implement a Tile-Based Rendering (TBR) architecture which differentiates two main phases: the Geometry Pipeline , where all the geometry of a scene is processed; and the Raster Pipeline , where primitives are rendered in a framebuffer. After the Geometry Pipeline, only non-culled primitives inside the camera’s frustum are stored into the Parameter Buffer , a data structure stored in DRAM. However, among the non-culled primitives there is a significant amount that are rendered but non-visible at all , resulting in useless computations. On average, 60% of those primitives are completely occluded in our benchmarks. Despite TBR architectures use on-chip caches for the Parameter Buffer, about 46% of the DRAM traffic still comes from accesses to such buffer. The proposed Triangle Dropping technique leverages the visibility information computed along the Raster Pipeline to predict the primitives’ visibility in the next frame to early discard those that will be totally occluded, drastically reducing Parameter Buffer accesses. On average, our approach achieves overall 14.5% energy savings, 28.2% energy-delay product savings and a speedup of 20.2%.
Horng-Ruey Huang, Ding-Yong Hong, Jan-Jan Wu, Kung-Fu Chen, Pangfeng Liu, Wei-Chung Hsu
ACM Transactions on Architecture and Code Optimization; https://doi.org/10.1145/3527609

Abstract:
Video captioning is a core technology to many important applications such as AI-assisted medical diagnosis, video question answering, storytelling through videos, and lip-reading. Video captioning employs a hybrid CNN+RNN model. Accelerating such a hybrid model on a heterogeneous system is challenging because (1) CNN and RNN exhibit very different computing behaviors, making the mapping between computation and heterogeneous devices difficult. (2) Data dependency exists between the CNN and RNN within a video frame and between adjacent RNNs across video frames. These data dependencies prohibit the full parallelization of the hybrid model. The issues also include the utilization of accelerator resources, which is critical to maximizing the performance. In this work, we propose a fine-grained scheduling scheme for mapping computation and devices within a video frame, and a pipeline scheduling scheme for exploiting maximum parallelism between the execution of the video frames. In addition, we propose two capacity-guided scheduling methods. On the server, the concurrent kernel execution mechanism is exploited for improving GPU utilization. On the edge platform, we re-arrange CNN computation among the CPU and EdgeTPUs guided by the EdgeTPU’s SRAM capacity, so that balanced computation is achieved and off-chip memory overhead is minimized. Experimental results show that our scheduling scheme improves video captioning performance by up to 3.24 × with CPU+GPU collaboration over the GPU-only execution. On an edge platform with an ARM CPU and two EdgeTPUs, our CPU+EdgeTPU scheduling exhibits outstanding performance, which achieves up to 54.9 × speedup compared to using ARM CPU only and can perform video captioning of 59 frames per second.
Cesar Gomes, Maziar Amiraski, Mark Hempstead
ACM Transactions on Architecture and Code Optimization, Volume 19, pp 1-27; https://doi.org/10.1145/3494538

Abstract:
Cache management policies should consider workloads’ contention behavior when managing a shared cache. Prior art makes estimates about shared cache behavior by adding extra logic or time to isolate per workload cache statistics. These approaches provide per-workload analysis but do not provide a holistic understanding of the utilization and effectiveness of caches under the ever-growing contention that comes standard with scaling cores. We present Contention Analysis in Shared Hierarchies using Thefts, or CASHT, 1 a framework for capturing cache contention information both offline and online. CASHT takes advantage of cache statistics made richer by observing a consequence of cache contention: inter-core evictions, or what we call THEFTS. We use thefts to complement more familiar cache statistics to train a learning model based on Gradient-boosting Trees (GBT) to predict the best ways to partition the last-level cache. GBT achieves 90+% accuracy with trained models as small as 100 B and at least 95% accuracy at 1 kB model size when predicting the best way to partition two workloads. CASHT employs a novel run-time framework for collecting thefts-based metrics despite partition intervention, and enables per-access sampling rather than set sampling that could add overhead but may not capture true workload behavior. Coupling CASHT and GBT for use as a dynamic policy results in a very lightweight and dynamic partitioning scheme that performs within a margin of error of Utility-based Cache Partitioning at a 1/8 the overhead.
, Xiaoshe Dsong, Longxiang Wang, Weiduo Chen, Xingjun Zhang
ACM Transactions on Architecture and Code Optimization, Volume 19, pp 1-24; https://doi.org/10.1145/3500917

Abstract:
In recent years, researches on disk fault detection based on SMART data combined with different machine learning algorithms have been proven to be effective. However, these methods require a large amount of data. In the early stages of the establishment of a data center or the deployment of new storage devices, the amount of reliability data for disks is relatively limited, and the amount of failed disk data is even less, resulting in the unsatisfactory detection performances of machine learning algorithms. To solve the above problems, we propose a novel small sample disk fault detection (SSDFD) 1 optimizing method based on Generative Adversarial Networks (GANs). Combined with the characteristics of hard disk reliability data, the generator of the original GAN is improved based on Long Short-Term Memory (LSTM), making it suitable for the generation of failed disk data. To alleviate the problem of data imbalance and expand the failed disk dataset with reduced amounts of original data, the proposed model is trained through adversarial training, which focuses on the generation of failed disk data. Experimental results on real HDD datasets show that SSDFD can generate enough virtual failed disk data to enable the machine learning algorithm to detect disk faults with increased accuracy under the condition of a few original failed disk data. Furthermore, the model trained with 300 original failed disk data has a significant effect on improving the accuracy of HDD fault detection. The optimal amount of generated virtual data are, 20–30 times that of the original data.
Dennis Rieber, Axel Acosta, Holger Fröning
ACM Transactions on Architecture and Code Optimization, Volume 19, pp 1-26; https://doi.org/10.1145/3487922

Abstract:
The success of Deep Artificial Neural Networks (DNNs) in many domains created a rich body of research concerned with hardware accelerators for compute-intensive DNN operators. However, implementing such operators efficiently with complex hardware intrinsics such as matrix multiply is a task not yet automated gracefully. Solving this task often requires joint program and data layout transformations. First solutions to this problem have been proposed, such as TVM, UNIT, or ISAMIR, which work on a loop-level representation of operators and specify data layout and possible program transformations before the embedding into the operator is performed. This top-down approach creates a tension between exploration range and search space complexity, especially when also exploring data layout transformations such as im2col, channel packing, or padding. In this work, we propose a new approach to this problem. We created a bottom-up method that allows the joint transformation of both computation and data layout based on the found embedding. By formulating the embedding as a constraint satisfaction problem over the scalar dataflow, every possible embedding solution is contained in the search space. Adding additional constraints and optimization targets to the solver generates the subset of preferable solutions. An evaluation using the VTA hardware accelerator with the Baidu DeepBench inference benchmark shows that our approach can automatically generate code competitive to reference implementations. Further, we show that dynamically determining the data layout based on intrinsic and workload is beneficial for hardware utilization and performance. In cases where the reference implementation has low hardware utilization due to its fixed deployment strategy, we achieve a geomean speedup of up to × 2.813, while individual operators can improve as much as × 170.
Back to Top Top