Refine Search

New Search

Results in Journal ACM Transactions on Architecture and Code Optimization: 767

(searched for: journal_id:(1674102))
Page of 16
Articles per Page
by
Show export options
  Select all
Michael Stokes, David Whalley, Soner Onder
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-22; https://doi.org/10.1145/3449043

Abstract:
While data filter caches (DFCs) have been shown to be effective at reducing data access energy, they have not been adopted in processors due to the associated performance penalty caused by high DFC miss rates. In this article, we present a design that both decreases the DFC miss rate and completely eliminates the DFC performance penalty even for a level-one data cache (L1 DC) with a single cycle access time. First, we show that a DFC that lazily fills each word in a DFC line from an L1 DC only when the word is referenced is more energy-efficient than eagerly filling the entire DFC line. For a 512B DFC, we are able to eliminate loads of words into the DFC that are never referenced before being evicted, which occurred for about 75% of the words in 32B lines. Second, we demonstrate that a lazily word filled DFC line can effectively share and pack data words from multiple L1 DC lines to lower the DFC miss rate. For a 512B DFC, we completely avoid accessing the L1 DC for loads about 23% of the time and avoid a fully associative L1 DC access for loads 50% of the time, where the DFC only requires about 2.5% of the size of the L1 DC. Finally, we present a method that completely eliminates the DFC performance penalty by speculatively performing DFC tag checks early and only accessing DFC data when a hit is guaranteed. For a 512B DFC, we improve data access energy usage for the DTLB and L1 DC by 33% with no performance degradation.
Yashuai Lü, Hui Guo, Libo Huang, Qi Yu, Li Shen, Nong Xiao, Zhiying Wang
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-24; https://doi.org/10.1145/3450440

Abstract:
Due to massive thread-level parallelism, GPUs have become an attractive platform for accelerating large-scale data parallel computations, such as graph processing. However, achieving high performance for graph processing with GPUs is non-trivial. Processing graphs on GPUs introduces several problems, such as load imbalance, low utilization of hardware unit, and memory divergence. Although previous work has proposed several software strategies to optimize graph processing on GPUs, there are several issues beyond the capability of software techniques to address. In this article, we present GraphPEG, a graph processing engine for efficient graph processing on GPUs. Inspired by the observation that many graph algorithms have a common pattern on graph traversal, GraphPEG improves the performance of graph processing by coupling automatic edge gathering with fine-grain work distribution. GraphPEG can also adapt to various input graph datasets and simplify the software design of graph processing with hardware-assisted graph traversal. Simulation results show that, in comparison with two representative highly efficient GPU graph processing software framework Gunrock and SEP-Graph, GraphPEG improves graph processing throughput by 2.8× and 2.5× on average, and up to 7.3× and 7.0× for six graph algorithm benchmarks on six graph datasets, with marginal hardware cost.
Shoaib Akram
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-26; https://doi.org/10.1145/3451342

Abstract:
Intel Optane memory offers non-volatility, byte addressability, and high capacity. It suits managed workloads that prefer large main memory heaps. We investigate Optane as the main memory for managed (Java) workloads, focusing on performance scalability. As the workload (core count) increases, we note Optane’s performance relative to DRAM. A few workloads incur a slight slowdown on Optane memory, which helps conserve limited DRAM capacity. Unfortunately, other workloads scale poorly beyond a few core counts. This article investigates scaling bottlenecks for Java workloads on Optane memory, analyzing the application, runtime, and microarchitectural interactions. Poorly scaling workloads allocate objects rapidly and access objects in Optane memory frequently. These characteristics slow down the mutator and substantially slow down garbage collection (GC). At the microarchitecture level, load, store, and instruction miss penalties rise. To regain performance, we partition heaps across DRAM and Optane memory, a hybrid that scales considerably better than Optane alone. We exploit state-of-the-art GC approaches to partition heaps. Unfortunately, existing GC approaches needlessly waste DRAM capacity because they ignore runtime behavior. This article also introduces performance impact-guided memory allocation (PIMA) for hybrid memories. PIMA maximizes Optane utilization, allocating in DRAM only if it improves performance. It estimates the performance impact of allocating heaps in either memory type by sampling. We target PIMA at graph analytics workloads, offering a novel performance estimation method and detailed evaluation. PIMA identifies workload phases that benefit from DRAM with high (94.33%) accuracy, incurring only a 2% sampling overhead. PIMA operates stand-alone or combines with prior approaches to offer new performance versus DRAM capacity trade-offs. This work opens up Optane memory to a real-life role as the main memory for Java workloads.
Ricardo Alves, Stefanos Kaxiras, David Black-Schaffer
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-22; https://doi.org/10.1145/3458883

Abstract:
Achieving low load-to-use latency with low energy and storage overheads is critical for performance. Existing techniques either prefetch into the pipeline (via address prediction and validation) or provide data reuse in the pipeline (via register sharing or L0 caches). These techniques provide a range of tradeoffs between latency, reuse, and overhead. In this work, we present a pipeline prefetching technique that achieves state-of-the-art performance and data reuse without additional data storage, data movement, or validation overheads by adding address tags to the register file. Our addition of register file tags allows us to forward (reuse) load data from the register file with no additional data movement, keep the data alive in the register file beyond the instruction’s lifetime to increase temporal reuse, and coalesce prefetch requests to achieve spatial reuse. Further, we show that we can use the existing memory order violation detection hardware to validate prefetches and data forwards without additional overhead. Our design achieves the performance of existing pipeline prefetching while also forwarding 32% of the loads from the register file (compared to 15% in state-of-the-art register sharing), delivering a 16% reduction in L1 dynamic energy (1.6% total processor energy), with an area overhead of less than 0.5%.
Mustafa Cavus, Mohammed Shatnawi, Resit Sendag, Augustus K. Uht
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-26; https://doi.org/10.1145/3452099

Abstract:
Lookup operations for in-memory databases are heavily memory bound, because they often rely on pointer-chasing linked data structure traversals. They also have many branches that are hard-to-predict due to random key lookups. In this study, we show that although cache misses are the primary bottleneck for these applications, without a method for eliminating the branch mispredictions only a small fraction of the performance benefit is achieved through prefetching alone. We propose the Node Tracker (NT), a novel programmable prefetcher/pre-execution unit that is highly effective in exploiting inter key-lookup parallelism to improve single-thread performance. We extend NT with branch outcome streaming (BOS) to reduce branch mispredictions and show that this achieves an extra 3× speedup. Finally, we evaluate the NT as a pre-execution unit and demonstrate that we can further improve the performance in both single- and multi-threaded execution modes. Our results show that, on average, NT improves single-thread performance by 4.1× when used as a prefetcher; 11.9× as a prefetcher with BOS; 14.9× as a pre-execution unit and 18.8× as a pre-execution unit with BOS. Finally, with 24 cores of the latter version, we achieve a speedup of 203× and 11× over the single-core and 24-core baselines, respectively.
Daniel Rodrigues Carvalho, André Seznec
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-27; https://doi.org/10.1145/3457207

Abstract:
Hardware cache compression derives from software-compression research; yet, its implementation is not a straightforward translation, since it must abide by multiple restrictions to comply with area, power, and latency constraints. This study sheds light on the challenges of adopting compression in cache design—from the shrinking of the data until its physical placement. The goal of this article is not to summarize proposals but to put in evidence the solutions they employ to handle those challenges. An in-depth description of the main characteristics of multiple methods is provided, as well as criteria that can be used as a basis for the assessment of such schemes. It is expected that this article will ease the understanding of decisions to be taken for the design of compressed systems and provide directions for future work.
Weijia Song, Christina Delimitrou, Zhiming Shen, Robbert Van Renesse, Hakim Weatherspoon, Lotfi Benmohamed, Frederic De Vaulx, Charif Mahmoudi
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-25; https://doi.org/10.1145/3457373

Abstract:
Infrastructure-as-a-Service cloud providers sell virtual machines that are only specified in terms of number of CPU cores, amount of memory, and I/O throughput. Performance-critical aspects such as cache sizes and memory latency are missing or reported in ways that make them hard to compare across cloud providers. It is difficult for users to adapt their application’s behavior to the available resources. In this work, we aim to increase the visibility that cloud users have into shared resources on public clouds. Specifically, we present CacheInspector , a lightweight runtime that determines the performance and allocated capacity of shared caches on multi-tenant public clouds. We validate CacheInspector ’s accuracy in a controlled environment, and use it to study the characteristics and variability of cache resources in the cloud, across time, instances, availability regions, and cloud providers. We show that CacheInspector ’s output allows cloud users to tailor their application’s behavior, including their output quality, to avoid suboptimal performance when resources are scarce.
Jose M. Rodriguez Borbon, Junjie Huang, Bryan M. Wong, Walid Najjar
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-25; https://doi.org/10.1145/3447775

Abstract:
QR decomposition is one of the most useful factorization kernels in modern numerical linear algebra algorithms. In particular, the decomposition of tall-and-skinny matrices (TSMs) has major applications in areas including scientific computing, machine learning, image processing, wireless networks, and numerical methods. Traditionally, CPUs and GPUs have achieved better throughput on these applications by using large cache hierarchies and compute cores running at a high frequency, leading to high power consumption. With the advent of heterogeneous platforms, however, FPGAs are emerging as a promising viable alternative. In this work, we propose a high-throughput FPGA-based engine that has a very high computational efficiency (ratio of achieved to peak throughput) compared to similar QR solvers running on FPGAs. Although comparable QR solvers achieve an efficiency of 36%, our design exhibits an efficiency of 54%. For TSMs, our experimental results show that our design can outperform highly optimized QR solvers running on CPUs and GPUs. For TSMs with more than 50K rows, our design outperforms the Intel MKL solver running on an Intel quad-core processor by a factor of 1.5×. For TSMs containing 256 columns or less, our design outperforms the NVIDIA CUBLAS solver running on a K40 GPU by a factor of 3.0×. In addition to being fast, our design is energy efficient—competing platforms execute up to 0.6 GFLOPS/Joule, whereas our design executes more than 1.0 GFLOPS/Joule.
João P. L. De Carvalho, Braedy Kuzma, Ivan Korostelev, José Nelson Amaral, Christopher Barton, José Moreira, Guido Araujo
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-22; https://doi.org/10.1145/3459010

Abstract:
Well-crafted libraries deliver much higher performance than code generated by sophisticated application programmers using advanced optimizing compilers. When a code pattern for which a well-tuned library implementation exists is found in the source code of an application, the highest performing solution is to replace the pattern with a call to the library. Idiom-recognition solutions in the past either required pattern matching machinery that was outside of the compilation framework or provided a very brittle solution that would fail even for minor variants in the pattern source code. This article introduces Kernel Find & Replacer ( KernelFaRer ), an idiom recognizer implemented entirely in the existing LLVM compiler framework. The versatility of KernelFaRer is demonstrated by matching and replacing two linear algebra idioms, general matrix-matrix multiplication (GEMM), and symmetric rank-2k update (SYR2K). Both GEMM and SYR2K are used extensively in scientific computation, and GEMM is also a central building block for deep learning and computer graphics algorithms. The idiom recognition in KernelFaRer is much more robust than alternative solutions, has a much lower compilation overhead, and is fully integrated in the broadly used LLVM compilation tools. KernelFaRer replaces existing GEMM and SYR2K idioms with computations performed by BLAS, Eigen, MKL (Intel’s x86), ESSL (IBM’s PowerPC), and BLIS (AMD). Gains in performance that reach 2000× over hand-crafted source code compiled at the highest optimization level demonstrate that replacing application code with library call is a performant solution.
Daniel Thuerck, Nicolas Weber, Roberto Bifulco
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-26; https://doi.org/10.1145/3458357

Abstract:
A large portion of the recent performance increase in the High Performance Computing (HPC) and Machine Learning (ML) domains is fueled by accelerator cards. Many popular ML frameworks support accelerators by organizing computations as a computational graph over a set of highly optimized, batched general-purpose kernels. While this approach simplifies the kernels’ implementation for each individual accelerator, the increasing heterogeneity among accelerator architectures for HPC complicates the creation of portable and extensible libraries of such kernels. Therefore, using a generalization of the CUDA community’s warp register cache programming idiom, we propose a new programming idiom (CoRe) and a virtual architecture model (PIRCH), abstracting over SIMD and SIMT paradigms. We define and automate the mapping process from a single source to PIRCH’s intermediate representation and develop backends that issue code for three different architectures: Intel AVX512, NVIDIA GPUs, and NEC SX-Aurora. Code generated by our source-to-source compiler for batched kernels, borG, competes favorably with vendor-tuned libraries and is up to 2× faster than hand-tuned kernels across architectures.
George Charitopoulos, Dionisios N. Pnevmatikatos, Georgi Gaydadjiev
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-25; https://doi.org/10.1145/3447970

Abstract:
Executing complex scientific applications on Coarse-Grain Reconfigurable Arrays ( CGRAs ) promises improvements in execution time and/or energy consumption compared to optimized software implementations or even fully customized hardware solutions. Typical CGRA architectures contain of multiple instances of the same compute module that consist of simple and general hardware units such as ALUs, simple processors. However, generality in the cell contents, while convenient for serving a wide variety of applications, penalizes performance and energy efficiency. To that end, a few proposed CGRAs use custom logic tailored to a particular application’s specific characteristics in the compute module. This approach, while much more efficient, restricts the versatility of the array. To date, versatility at hardware speeds is only supported with Field programmable gate arrays (FPGAs), that are reconfigurable at a very fine grain. This work proposes MC-DeF, a novel Mixed-CGRA Definition Framework targeting a Mixed-CGRA architecture that leverages the advantages of CGRAs by utilizing a customized cell array, and those of FPGAs by incorporating a separate LUT array used for adaptability. The framework presented aims to develop a complete CGRA architecture. First, a cell structure and functionality definition phase creates highly customized application/domain specific CGRA cells. Then, mapping and routing phases define the CGRA connectivity and cell-LUT array transactions. Finally, an energy and area estimation phase presents the user with area occupancy and energy consumption estimations of the final design. MC-DeF uses novel algorithms and cost functions driven by user defined metrics, threshold values, and area/energy restrictions. The benefits of our framework, besides creating fast and efficient CGRA designs, include design space exploration capabilities offered to the user. The validity of the presented framework is demonstrated by evaluating and creating CGRA designs of nine applications. Additionally, we provide comparisons of MC-DeF with state-of-the-art related works, and show that MC-DeF offers competitive performance (in terms of internal bandwidth and processing throughput) even compared against much larger designs, and requires fewer physical resources to achieve this level of performance. Finally, MC-DeF is able to better utilize the underlying FPGA fabric and achieves the best efficiency (measured in LUT/GOPs).
Hamza Omar, Omer Khan
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-25; https://doi.org/10.1145/3450523

Abstract:
Multicores increasingly deploy safety-critical parallel applications that demand resiliency against soft-errors to satisfy the safety standards. However, protection against these errors is challenging due to complex communication and data access protocols that aggressively share on-chip hardware resources. Research has explored various temporal and spatial redundancy-based resiliency schemes that provide multicores with high soft-error coverage. However, redundant execution incurs performance overheads due to interference effects induced by aggressive resource sharing. Moreover, these schemes require intrusive hardware modifications and fall short in providing efficient system availability guarantees. This article proposes PRISM, a resilient multicore architecture that incorporates strong hardware isolation to form redundant clusters of cores, ensuring a non-interference-based redundant execution environment. A soft error in one cluster does not effect the execution of the other cluster, resulting in high system availability. Implementing strong isolation for shared hardware resources, such as queues, caches, and networks requires logic for partitioning. However, it is less intrusive as complex hardware modifications to protocols, such as hardware cache coherence, are avoided. The PRISM approach is prototyped on a real Tilera Tile-Gx72 processor that enables primitives to implement the proposed cluster-level hardware resource isolation. The evaluation shows performance benefits from avoiding destructive hardware interference effects with redundant execution, while delivering superior system availability.
Wim Heirman, Stijn Eyerman, Kristof Du Bois, Ibrahim Hur
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-23; https://doi.org/10.1145/3452141

Abstract:
Sparse memory accesses, which are scattered accesses to single elements of a large data structure, are a challenge for current processor architectures. Their lack of spatial and temporal locality and their irregularity makes caches and traditional stream prefetchers useless. Furthermore, performing standard caching and prefetching on sparse accesses wastes precious memory bandwidth and thrashes caches, deteriorating performance for regular accesses. Bypassing prefetchers and caches for sparse accesses, and fetching only a single element (e.g., 8 B) from main memory (subline access), can solve these issues. Deciding which accesses to handle as sparse accesses and which as regular cached accesses, is a challenging task, with a large potential impact on performance. Not only is performance reduced by treating sparse accesses as regular accesses, not caching accesses that do have locality also negatively impacts performance by significantly increasing their latency and bandwidth consumption. Furthermore, this decision depends on the dynamic environment, such as input set characteristics and system load, making a static decision by the programmer or compiler suboptimal. We propose the Instruction Spatial Locality Estimator ( ISLE ), a hardware detector that finds instructions that access isolated words in a sea of unused data. These sparse accesses are dynamically converted into uncached subline accesses, while keeping regular accesses cached. ISLE does not require modifying source code or binaries, and adapts automatically to a changing environment (input data, available bandwidth, etc.). We apply ISLE to a graph analytics processor running sparse graph workloads, and show that ISLE outperforms the performance of no subline accesses, manual sublining, and prior work on detecting sparse accesses.
Sugandha Tiwari, Neel Gala, Chester Rebeiro, V. Kamakoti
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-26; https://doi.org/10.1145/3446210

Abstract:
Owing to the failure of Dennard’s scaling, the past decade has seen a steep growth of prominent new paradigms leveraging opportunities in computer architecture. Two technologies of interest are Posit and RISC-V. Posit was introduced in mid-2017 as a viable alternative to IEEE-754, and RISC-V provides a commercial-grade open source Instruction Set Architecture (ISA). In this article, we bring these two technologies together and propose a Configurable Posit Enabled RISC-V Core called PERI. The article provides insights on how the Single-Precision Floating Point (“F”) extension of RISC-V can be leveraged to support posit arithmetic. We also present the implementation details of a parameterized and feature-complete posit Floating Point Unit (FPU). The configurability and the parameterization features of this unit generate optimal hardware, which caters to the accuracy and energy/area tradeoffs imposed by the applications, a feature not possible with IEEE-754 implementation. The posit FPU has been integrated with the RISC-V compliant SHAKTI C-class core as an execution unit. To further leverage the potential of posit , we enhance our posit FPU to support two different exponent sizes (with posit-size being 32-bits), thereby enabling multiple-precision at runtime. To enable the compilation and execution of C programs on PERI, we have made minimal modifications to the GNU C Compiler (GCC), targeting the “F” extension of the RISC-V. We compare posit with IEEE-754 in terms of hardware area, application accuracy, and runtime. We also present an alternate methodology of integrating the posit FPU with the RISC-V core as an accelerator using the custom opcode space of RISC-V.
Devashree Tripathy, Amirali Abdolrashidi, Laxmi Narayan Bhuyan, Liang Zhou, Daniel Wong
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-26; https://doi.org/10.1145/3451164

Abstract:
The massive parallelism present in GPUs comes at the cost of reduced L1 and L2 cache sizes per thread, leading to serious cache contention problems such as thrashing. Hence, the data access locality of an application should be considered during thread scheduling to improve execution time and energy consumption. Recent works have tried to use the locality behavior of regular and structured applications in thread scheduling, but the difficult case of irregular and unstructured parallel applications remains to be explored. We present PAVER , a P riority- A ware V ertex schedul ER , which takes a graph-theoretic approach toward thread scheduling. We analyze the cache locality behavior among thread blocks ( TBs ) through a just-in-time compilation, and represent the problem using a graph representing the TBs and the locality among them. This graph is then partitioned to TB groups that display maximum data sharing, which are then assigned to the same streaming multiprocessor by the locality-aware TB scheduler. Through exhaustive simulation in Fermi, Pascal, and Volta architectures using a number of scheduling techniques, we show that PAVER reduces L2 accesses by 43.3%, 48.5%, and 40.21% and increases the average performance benefit by 29%, 49.1%, and 41.2% for the benchmarks with high inter-TB locality.
Nils Voss, Bastiaan Kwaadgras, Oskar Mencer, Wayne Luk, Georgi Gaydadjiev
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-28; https://doi.org/10.1145/3436995

Abstract:
We propose a design methodology to facilitate rigorous development of complex applications targeting reconfigurable hardware. Our methodology relies on analytical estimation of system performance and area utilisation for a given specific application and a particular system instance consisting of a controlflow machine working in conjunction with one or more reconfigurable dataflow accelerators. The targeted application is carefully analyzed, and the parts identified for hardware acceleration are reimplemented as a set of representative software models. Next, with the results of the application analysis, a suitable system architecture is devised and its performance is evaluated to determine bottlenecks, allowing predictable design. The architecture is iteratively refined, until the final version satisfying the specification requirements in terms of performance and required hardware area is obtained. We validate the presented methodology using a widely accepted convolutional neural network (VGG-16) and an important HPC application (BQCD). In both cases, our methodology relieved and alleviated all system bottlenecks before the hardware implementation was started. As a result the architectures were implemented first time right, achieving state-of-the-art performance within 15% of our modelling estimations.
Pengyu Wang, Jing Wang, Chao Li, Jianzong Wang, Haojin Zhu, Minyi Guo
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-25; https://doi.org/10.1145/3444844

Abstract:
Today’s GPU graph processing frameworks face scalability and efficiency issues as the graph size exceeds GPU-dedicated memory limit. Although recent GPUs can over-subscribe memory with Unified Memory (UM), they incur significant overhead when handling graph-structured data. In addition, many popular processing frameworks suffer sub-optimal efficiency due to heavy atomic operations when tracking the active vertices. This article presents Grus, a novel system framework that allows GPU graph processing to stay competitive with the ever-growing graph complexity. Grus improves space efficiency through a UM trimming scheme tailored to the data access behaviors of graph workloads. It also uses a lightweight frontier structure to further reduce atomic operations. With easy-to-use interface that abstracts the above details, Grus shows up to 6.4× average speedup over the state-of-the-art in-memory GPU graph processing framework. It allows one to process large graphs of 5.5 billion edges in seconds with a single GPU.
Muhammad Hassan, Chang Hyun Park, David Black-Schaffer
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-20; https://doi.org/10.1145/3446200

Abstract:
The SPEC CPU Benchmarks are used extensively for evaluating and comparing improvements to computer systems. This ubiquity makes characterization critical for researchers to understand the bottlenecks the benchmarks do and do not expose and where new designs should and should not be expected to show impact. However, in characterization there is a tradeoff between accuracy and reusability: The more precisely we characterize a benchmark’s performance on a given system, the less usable it is across different micro-architectures and varying memory configurations. For SPEC, most existing characterizations include system-specific effects (e.g., via performance counters) and/or only look at aggregate behavior (e.g., averages over the full application execution). While such approaches simplify characterization, they make it difficult to separate the applications’ intrinsic behavior from the system-specific effects and/or lose the diverse phase-based behaviors. In this work we focus on characterizing the applications’ intrinsic memory behaviour by isolating them from micro-architectural configuration specifics. We do this by providing a simplified generic system model that evaluates the applications’ memory behavior across multiple cache sizes, with and without prefetching, and over time. The resulting characterization can be reused across a range of systems to understand application behavior and allow us to see how frequently different behaviors occur. We use this approach to compare the SPEC 2006 and 2017 suites, providing insight into their memory system behaviour beyond previous system-specific and/or aggregate results. We demonstrate the ability to use this characterization in different contexts by showing a portion of the SPEC 2017 benchmark suite that could benefit from giga-scale caches, despite aggregate results indicating otherwise.
Anirudh Mohan Kaushik, Gennady Pekhimenko, Hiren Patel
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-25; https://doi.org/10.1145/3439803

Abstract:
Data-dependent memory accesses (DDAs) pose an important challenge for high-performance graph analytics (GA). This is because such memory accesses do not exhibit enough temporal and spatial locality resulting in low cache performance. Prior efforts that focused on improving the performance of DDAs for GA are not applicable across various GA frameworks. This is because (1) they only focus on one particular graph representation, and (2) they require workload changes to communicate specific information to the hardware for their effective operation. In this work, we propose a hardware-only solution to improving the performance of DDAs for GA across multiple GA frameworks. We present a hardware prefetcher for GA called Gretch, that addresses the above limitations. An important observation we make is that identifying certain DDAs without hardware-software communication is sensitive to the instruction scheduling. A key contribution of this work is a hardware mechanism that activates Gretch to identify DDAs when using either in-order or out-of-order instruction scheduling. Our evaluation shows that Gretch provides an average speedup of 38% over no prefetching, 25% over conventional stride prefetcher, and outperforms prior DDAs prefetchers by 22% with only 1% increase in power consumption when executed on different GA workloads and frameworks.
Arnab Kumar Biswas
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-20; https://doi.org/10.1145/3443707

Abstract:
Program obfuscation is a widely used cryptographic software intellectual property (IP) protection technique against reverse engineering attacks in embedded systems. However, very few works have studied the impact of combining various obfuscation techniques on the obscurity (difficulty of reverse engineering) and performance (execution time) of obfuscated programs. In this article, we propose a Genetic Algorithm (GA)-based framework that not only optimizes obscurity and performance of obfuscated cryptographic programs, but it also ensures very low timing side-channel leakage. Our proposed T iming S ide C hannel S ensitive P rogram O bfuscation O ptimization F ramework (TSC-SPOOF) determines the combination of obfuscation transformation functions that produce optimized obfuscated programs with preferred optimization parameters. In particular, TSC-SPOOF employs normalized compression distance (NCD) and channel capacity to measure obscurity and timing side-channel leakage, respectively. We also use RISC-V rocket core running on a Xilinx Zynq FPGA device as part of our framework to obtain realistic results. The experimental results clearly show that our proposed solution leads to cryptographic programs with lower execution time, higher obscurity, and lower timing side-channel leakage than unguided obfuscation.
Maxime France-Pillois, Jérôme Martin, Frédéric Rousseau
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-22; https://doi.org/10.1145/3445030

Abstract:
Multi-core systems are now found in many electronic devices. But does current software design fully leverage their capabilities? The complexity of the hardware and software stacks in these platforms requires software optimization with end-to-end knowledge of the system. To optimize software performance, we must have accurate information about system behavior and time losses. Standard monitoring engines impose tradeoffs on profiling tools, making it impossible to reconcile all the expected requirements: accurate hardware views, fine-grain measurements, speed, and so on. Subsequently, new approaches have to be examined. In this article, we propose a non-intrusive, accurate tool chain, which can reveal and quantify slowdowns in low-level software mechanisms. Based on emulation, this tool chain extracts behavioral information (time, contention) through hardware side channels, without distorting the software execution flow. This tool consists of two parts. (1) An online acquisition part that dumps hardware platform signals. (2) An offline processing part that consolidates meaningful behavioral information from the dumped data. Using our tool chain, we studied and propose optimizations to MultiProcessor System on Chip (MPSoC) support in the Linux kernel, saving about 60% of the time required for the release phase of the GNU OpenMP synchronization barrier when running on a 64-core MPSoC.
Ramin Izadpanah, Christina Peterson, Yan Solihin, Damian Dechev
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-26; https://doi.org/10.1145/3446391

Abstract:
Emerging byte-addressable Non-Volatile Memories (NVMs) enable persistent memory where process state can be recovered after crashes. To enable applications to rely on persistent data, durable data structures with failure-atomic operations have been proposed. However, they lack the ability to allow users to execute a sequence of operations as transactions. Meanwhile, persistent transactional memory (PTM) has been proposed by adding durability to Software Transactional Memory (STM). However, PTM suffers from high performance overheads and low scalability due to false aborts, logging, and ordering constraints on persistence. In this article, we propose PETRA, a new approach for constructing persistent transactional linked data structures. PETRA natively supports transactions, but unlike PTM, relies on the high-level information from the data structure semantics. This gives PETRA unique advantages in the form of high performance and high scalability. Our experimental results using various benchmarks demonstrate the scalability of PETRA in all workloads and transaction sizes. PETRA outperforms the state-of-the-art PTMs by an order of magnitude in transactions of size greater than one, and demonstrates superior performance in transactions of size one.
Nhut-Minh Ho, Himeshi De Silva, Weng-Fai Wong
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-24; https://doi.org/10.1145/3441830

Abstract:
This article presents GRAM (GPU-based Runtime Adaption for Mixed-precision) a framework for the effective use of mixed precision arithmetic for CUDA programs. Our method provides a fine-grain tradeoff between output error and performance. It can create many variants that satisfy different accuracy requirements by assigning different groups of threads to different precision levels adaptively at runtime . To widen the range of applications that can benefit from its approximation, GRAM comes with an optional half-precision approximate math library. Using GRAM, we can trade off precision for any performance improvement of up to 540%, depending on the application and accuracy requirement.
Lorenz Braun, Sotirios Nikas, Chen Song, Vincent Heuveline, Holger Fröning
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-25; https://doi.org/10.1145/3431731

Abstract:
Characterizing compute kernel execution behavior on GPUs for efficient task scheduling is a non-trivial task. We address this with a simple model enabling portable and fast predictions among different GPUs using only hardware-independent features. This model is built based on random forests using 189 individual compute kernels from benchmarks such as Parboil, Rodinia, Polybench-GPU, and SHOC. Evaluation of the model performance using cross-validation yields a median Mean Average Percentage Error (MAPE) of 8.86–52.0% for time and 1.84–2.94% for power prediction across five different GPUs, while latency for a single prediction varies between 15 and 108 ms.
Kleovoulos Kalaitzidis, André Seznec
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-20; https://doi.org/10.1145/3436821

Abstract:
Value Prediction (VP) has recently been gaining interest in the research community, since prior work has established practical solutions for its implementation that provide meaningful performance gains. A constant challenge of contemporary context-based value predictors is to sufficiently capture value redundancy and exploit the predictable execution paths. To do so, modern context-based VP techniques tightly associate recurring values with instructions and contexts by building confidence upon them after a plethora of repetitions. However, when execution monotony exists in the form of intervals, the potential prediction coverage is limited, since prediction confidence is reset at the beginning of each new interval. In this study, we address this challenge by introducing the notion of Equality Prediction (EP), which represents the binary facet of VP. Following a twofold decision scheme (similar to branch prediction), at fetch time, EP makes use of control-flow history to predict equality between the last committed result for this instruction and the result of the currently fetched occurrence. When equality is predicted with high confidence, the last committed value is used. Our simulation results show that this technique obtains the same level of performance as previously proposed state-of-the-art context-based value predictors. However, by virtue of exploiting equality patterns that are not captured by previous VP schemes, our design can improve the speedup of standard VP by 19% on average, when combined with contemporary prediction models.
Marcel Mettler, Daniel Mueller-Gritschneder, Ulf Schlichtmann
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-25; https://doi.org/10.1145/3430699

Abstract:
Exhaustive verification techniques do not scale with the complexity of today’s multi-tile Multi-processor Systems-on-chip (MPSoCs). Hence, runtime verification (RV) has emerged as a complementary method, which verifies the correct behavior of applications executed on the MPSoC during runtime. In this article, we propose a decentralized monitoring architecture for large-scale multi-tile MPSoCs. In order to minimize performance and power overhead for RV, we propose a lightweight and non-intrusive hardware solution. It features a new specialized tracing interconnect that distributes and sorts detected events according to their timestamps. Each tile monitor has a consistent view on a globally sorted trace of events on which the behavior of the target application can be verified using logical and timing requirements. Furthermore, we propose an integer linear programming-based algorithm for the assignment of requirements to monitors to exploit the local resources best. The monitoring architecture is demonstrated for a four-tiled MPSoC with 20 cores implemented on a Virtex-7 field-programmable gate array (FPGA).
Yu Emma Wang, Carole-Jean Wu, Xiaodong Wang, Kim Hazelwood, David Brooks
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-23; https://doi.org/10.1145/3431388

Abstract:
State-of-the-art machine learning frameworks support a wide variety of design features to enable a flexible machine learning programming interface and to ease the programmability burden on machine learning developers. Identifying and using a performance-optimal setting in feature-rich frameworks, however, involves a non-trivial amount of performance profiling efforts and often relies on domain-specific knowledge. This article takes a deep dive into analyzing the performance impact of key design features in a machine learning framework and quantifies the role of parallelism. The observations and insights distill into a simple set of guidelines that one can use to achieve much higher training and inference speedup. Across a diverse set of real-world deep learning models, the evaluation results show that the proposed performance tuning guidelines outperform the Intel and TensorFlow recommended settings by 1.30× and 1.38×, respectively.
Abhishek Singh, Shail Dave, Pantea Zardoshti, Robert Brotzman, Chao Zhang, Xiaochen Guo, Aviral Shrivastava, Gang Tan, Michael Spear
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-26; https://doi.org/10.1145/3436730

Abstract:
General-purpose computing systems employ memory hierarchies to provide the appearance of a single large, fast, coherent memory. In special-purpose CPUs, programmers manually manage distinct, non-coherent scratchpad memories. In this article, we combine these mechanisms by adding a virtually addressed, set-associative scratchpad to a general purpose CPU. Our scratchpad exists alongside a traditional cache and is able to avoid many of the programming challenges associated with traditional scratchpads without sacrificing generality (e.g., virtualization). Furthermore, our design delivers increased security and improves performance, especially for workloads with high locality or that interact with nonvolatile memory.
Solomon Abera, M. Balakrishnan, Anshul Kumar
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-26; https://doi.org/10.1145/3427092

Abstract:
Chip multiprocessors (CMPs) are ubiquitous in all computing systems ranging from high-end servers to mobile devices. In these systems, energy consumption is a critical design constraint as it constitutes the most significant operating cost for computing clouds. Analogous to this, longer battery life continues to be an essential user concern in mobile devices. To optimize on power consumption, modern processors are designed with Dynamic Voltage and Frequency Scaling (DVFS) support at the individual core as well as the uncore level. This allows fine-grained control of performance and energy. For an n core processor with m core and uncore frequency choices, the total DVFS configuration space is now m (n+1) (with the uncore accounting for the + 1). In addition to that, in CMPs, the performance-energy trade-off due to core/uncore frequency scaling concerning a single application cannot be determined independently as cores share critical resources like the last level cache (LLC) and the memory. Thus, unlike the uni-processor environment, the energy consumption of an application running on a CMP depends not only on its characteristics but also on those of its co-runners (applications running on other cores). The key objective of our work is to select a suitable core and uncore frequency that minimizes power consumption while limiting application performance degradation within certain pre-defined limits (can be termed as QoS requirements). The key contribution of our work is a learning-based model that is able to capture the interference due to shared cache, bus bandwidth, and memory bandwidth between applications running on multiple cores and predict near-optimal frequencies for core and uncore.
Atefeh Mehrabi, Aninda Manocha, Benjamin C. Lee, Daniel J. Sorin
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-25; https://doi.org/10.1145/3427377

Abstract:
Accelerator design is expensive due to the effort required to understand an algorithm and optimize the design. Architects have embraced two technologies to reduce costs. High-level synthesis automatically generates hardware from code. Reconfigurable fabrics instantiate accelerators while avoiding fabrication costs for custom circuits. We further reduce design effort with statistical learning. We build an automated framework, called Prospector, that uses Bayesian techniques to optimize synthesis directives, reducing execution latency and resource usage in field-programmable gate arrays. We show in a certain amount of time that designs discovered by Prospector are closer to Pareto-efficient designs compared to prior approaches. Prospector permits new studies for heterogeneous accelerators.
Paolo Sylos Labini, Marco Cianfriglia, , Osvaldo Gervasi, Grigori Fursin, Anton Lokhmotov, Cedric Nugteren, Bruno Carpentieri, Fabiana Zollo, Flavio Vella
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-24; https://doi.org/10.1145/3434402

Abstract:
Efficient HPC libraries often expose multiple tunable parameters, algorithmic implementations, or a combination of them, to provide optimized routines. The optimal parameters and algorithmic choices may depend on input properties such as the shapes of the matrices involved in the operation. Traditionally, these parameters are manually tuned or set by auto-tuners. In emerging applications such as deep learning, this approach is not effective across the wide range of inputs and architectures used in practice. In this work, we analyze different machine learning techniques and predictive models to accelerate the convolution operator and GEMM. Moreover, we address the problem of dataset generation, and we study the performance, accuracy, and generalization ability of the models. Our insights allow us to improve the performance of computationally expensive deep learning primitives on high-end GPUs as well as low-power embedded GPU architectures on three different libraries. Experimental results show significant improvement in the target applications from 50% up to 300% compared to auto-tuned and high-optimized vendor-based heuristics by using simple decision tree- and MLP-based models.
Minsu Kim, Jeong-Keun Park, Soo-Mook Moon
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-23; https://doi.org/10.1145/3427378

Abstract:
Test-pattern programs are for testing DRAM memory chips. They run on a special embedded system called automated test equipment (ATE). Each ATE manufacturer provides its own programming language, which is mostly low level, thus accessing the registers in the ATE directly. The register structure of each ATE is quite different and highly irregular. Since DRAM chipmakers are often equipped with diverse ATEs from different manufacturers, they employ automatic translation of a program developed for one ATE to a program for different ATEs. This raises an irregular register allocation problem during translation. This article proposes a solution based on partitioned Boolean quadratic programming (PBQP). PBQP has been used for a number of compiler optimizations, including paired register allocation , which our ATE register allocation also requires. Moreover, the interleaved processing in ATE incurs complex register constraints, which we could also formulate elegantly with PBQP. The original PBQP solver is not quite appropriate to use, though, since ATE register allocation does not allow spills, so we devised a more elaborate PBQP solver that trades off the allocation time and allocation search space, to find a solution in a reasonable amount of time. Our experimental results with product-level pattern programs show that the proposed register allocator successfully finds valid solutions in all cases, in the order of tenths of seconds.
Negin Nematollahi, Mohammad Sadrosadati, Hajar Falahati, Marzieh Barkhordar, Mario Paulo Drumond, Hamid Sarbazi-Azad, Babak Falsafi
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-26; https://doi.org/10.1145/3429981

Abstract:
Stencil codes (a.k.a. nearest-neighbor computations) are widely used in image processing, machine learning, and scientific applications. Stencil codes incur nearest-neighbor data exchange because the value of each point in the structured grid is calculated as a function of its value and the values of a subset of its nearest-neighbor points. When running on Graphics Processing Unit (GPUs), stencil codes exhibit a high degree of data sharing between nearest-neighbor threads. Sharing is typically implemented through shared memories, shuffle instructions, and on-chip caches and often incurs performance overheads due to the redundancy in memory accesses. In this article, we propose Neighbor Data (NeDa), a direct nearest-neighbor data sharing mechanism that uses two registers embedded in each streaming processor (SP) that can be accessed by nearest-neighbor SP cores. The registers are compiler-allocated and serve as a data exchange mechanism to eliminate nearest-neighbor shared accesses. NeDa is embedded carefully with local wires between SP cores so as to minimize the impact on density. We place and route NeDa in an open-source GPU and show a small area overhead of 1.3%. The cycle-accurate simulation indicates an average performance improvement of 21.8% and power reduction of up to 18.3% for stencil codes in General-Purpose Graphics Processing Unit (GPGPU) standard benchmark suites. We show that NeDa’s performance is within 13.2% of an ideal GPU with no overhead for nearest-neighbor data exchange.
Syed M. A. H. Jafri, Hasan Hassan, Ahmed Hemani, Onur Mutlu
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-29; https://doi.org/10.1145/3417708

Abstract:
To employ a Convolutional Neural Network (CNN) in an energy-constrained embedded system, it is critical for the CNN implementation to be highly energy efficient. Many recent studies propose CNN accelerator architectures with custom computation units that try to improve the energy efficiency and performance of CNNs by minimizing data transfers from DRAM-based main memory. However, in these architectures, DRAM is still responsible for half of the overall energy consumption of the system, on average. A key factor of the high energy consumption of DRAM is the refresh overhead , which is estimated to consume 40% of the total DRAM energy. In this article, we propose a new mechanism, Refresh Triggered Computation (RTC) , that exploits the memory access patterns of CNN applications to reduce the number of refresh operations . RTC uses two major techniques to mitigate the refresh overhead. First, Refresh Triggered Transfer (RTT) is based on our new observation that a CNN application accesses a large portion of the DRAM in a predictable and recurring manner. Thus, the read/write accesses of the application inherently refresh the DRAM, and therefore a significant fraction of refresh operations can be skipped. Second, Partial Array Auto-Refresh (PAAR) eliminates the refresh operations to DRAM regions that do not store any data. We propose three RTC designs (min-RTC, mid-RTC, and full-RTC), each of which requires a different level of aggressiveness in terms of customization to the DRAM subsystem. All of our designs have small overhead. Even the most aggressive RTC design (i.e., full-RTC) imposes an area overhead of only 0.18% in a 16 Gb DRAM chip and can have less overhead for denser chips. Our experimental evaluation on six well-known CNNs shows that RTC reduces average DRAM energy consumption by 24.4% and 61.3% for the least aggressive and the most aggressive RTC implementations, respectively. Besides CNNs, we also evaluate our RTC mechanism on three workloads from other domains. We show that RTC saves 31.9% and 16.9% DRAM energy for Face Recognition and Bayesian Confidence Propagation Neural Network (BCPNN) , respectively. We believe RTC can be applied to other applications whose memory access patterns remain predictable for a sufficiently long time.
Sujay Yadalam, Vinod Ganapathy, Arkaprava Basu
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-25; https://doi.org/10.1145/3433983

Abstract:
Intel’s SGX architecture offers clients of public cloud computing platforms the ability to create hardware-protected enclaves whose contents are protected from privileged system software. However, SGX relies on system software for enclave memory management. In a sequence of recent papers, researchers have demonstrated that this reliance allows a malicious OS/hypervisor to snoop on the page addresses being accessed from within an enclave via various channels. This page address stream can then be used to infer secrets if the enclave’s page access pattern depends upon the secret and this constitutes an important class of side-channels. We propose SG XL , a hardware-software co-designed system that significantly increases the difficulty of any page address-based side-channels through the use of large pages. A large page maps address ranges at a much larger granularity than the default page size (at least 512× larger). SG XL thus significantly lowers resolution of the leaked page address stream and could practically throttle all flavors of page-address based side-channels. We detail the modifications needed to SGX’s software stack and the (minor) hardware enhancements required for SG XL to guarantee the use of large pages in the presence of adversarial system software. We empirically show that SG XL could be one of those rare systems that enhances security with the potential of improving performance as well.
Sanket Tavarageri, Alexander Heinecke, Sasikanth Avancha, Bharat Kaul, Gagandeep Goyal, Ramakrishna Upadrasta
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-27; https://doi.org/10.1145/3433103

Abstract:
Deep Neural Networks (DNNs) have revolutionized many aspects of our lives. The use of DNNs is becoming ubiquitous, including in software for image recognition, speech recognition, speech synthesis, language translation, to name a few. The training of DNN architectures, however, is computationally expensive. Once the model is created, its use in the intended application—the inference task, is computationally heavy too and the inference needs to be fast for real time use. For obtaining high performance today, the code of Deep Learning (DL) primitives optimized for specific architectures by expert programmers exposed via libraries is the norm. However, given the constant emergence of new DNN architectures, creating hand optimized code is expensive, slow and is not scalable. To address this performance-productivity challenge, in this article we present compiler algorithms to automatically generate high-performance implementations of DL primitives that closely match the performance of hand optimized libraries. We develop novel data reuse analysis algorithms using the polyhedral model to derive efficient execution schedules automatically. In addition, because most DL primitives use some variant of matrix multiplication at their core, we develop a flexible framework where it is possible to plug in library implementations of the same in lieu of a subset of the loops. We show that such a hybrid compiler plus a minimal library-use approach results in state-of-the-art performance. We develop compiler algorithms to also perform operator fusions that reduce data movement through the memory hierarchy of the computer system. Using Convolution Neural Network (CNN) models and matrix multiplication operations, we demonstrate that our approach automatically creates high performing DNN building blocks whose performance matches the performance of hand-crafted kernels of Intel’s oneDNN library on high end CPUs. At the same time, our techniques take only a fraction of time (1/20 or less) compared to AutoTVM, a deep learning auto-tuner to create optimized implementations.
Sooraj Puthoor, Mikko H. Lipasti
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-27; https://doi.org/10.1145/3428153

Abstract:
Sequential consistency (SC) is the most intuitive memory consistency model and the easiest for programmers and hardware designers to reason about. However, the strict memory ordering restrictions imposed by SC make it less attractive from a performance standpoint. Additionally, prior high-performance SC implementations required complex hardware structures to support speculation and recovery. In this article, we introduce the lockstep SC consistency model (LSC), a new memory model based on SC but carefully defined to accommodate the data parallel lockstep execution paradigm of GPUs. We also describe an efficient LSC implementation for an APU system-on-chip (SoC) and show that our implementation performs close to the baseline relaxed model. Evaluation of our implementation shows that the geometric mean performance cost for lockstep SC is just 0.76% for GPU execution and 6.11% for the entire APU SoC compared to a baseline with a weaker memory consistency model. Adoption of LSC in future APU and SoC designs will reduce the burden on programmers trying to write correct parallel programs, while also simplifying the implementation and verification of systems with heterogeneous processing elements and complex memory hierarchies. 1
Ari Rasch, Richard Schulze, Michel Steuwer, Sergei Gorlatch
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-26; https://doi.org/10.1145/3427093

Abstract:
Auto-tuning is a popular approach to program optimization: it automatically finds good configurations of a program’s so-called tuning parameters whose values are crucial for achieving high performance for a particular parallel architecture and characteristics of input/output data. We present three new contributions of the Auto-Tuning Framework (ATF), which enable a key advantage in general-purpose auto-tuning : efficiently optimizing programs whose tuning parameters have interdependencies among them. We make the following contributions to the three main phases of general-purpose auto-tuning: (1) ATF generates the search space of interdependent tuning parameters with high performance by efficiently exploiting parameter constraints; (2) ATF stores such search spaces efficiently in memory, based on a novel chain-of-trees search space structure; (3) ATF explores these search spaces faster, by employing a multi-dimensional search strategy on its chain-of-trees search space representation. Our experiments demonstrate that, compared to the state-of-the-art, general-purpose auto-tuning frameworks, ATF substantially improves generating, storing, and exploring the search space of interdependent tuning parameters, thereby enabling an efficient overall auto-tuning process for important applications from popular domains, including stencil computations, linear algebra routines, quantum chemistry computations, and data mining algorithms.
Wenjie Liu, Shoaib Akram, Jennifer B. Sartor, Lieven Eeckhout
ACM Transactions on Architecture and Code Optimization, Volume 18, pp 1-25; https://doi.org/10.1145/3431803

Abstract:
Emerging workloads in cloud and data center infrastructures demand high main memory bandwidth and capacity. Unfortunately, DRAM alone is unable to satisfy contemporary main memory demands. High-bandwidth memory (HBM) uses 3D die-stacking to deliver 4–8× higher bandwidth. HBM has two drawbacks: (1) capacity is low, and (2) soft error rate is high. Hybrid memory combines DRAM and HBM to promise low fault rates, high bandwidth, and high capacity. Prior OS approaches manage HBM by mapping pages to HBM versus DRAM based on hotness (access frequency) and risk (susceptibility to soft errors). Unfortunately, these approaches operate at a coarse-grained page granularity, and frequent page migrations hurt performance. This article proposes a new class of reliability-aware garbage collectors for hybrid HBM-DRAM systems that place hot and low-risk objects in HBM and the rest in DRAM. Our analysis of nine real-world Java workloads shows that: (1) newly allocated objects in the nursery are frequently written, making them both hot and low-risk, (2) a small fraction of the mature objects are hot and low-risk, and (3) allocation site is a good predictor for hotness and risk. We propose RiskRelief, a novel reliability-aware garbage collector that uses allocation site prediction to place hot and low-risk objects in HBM. Allocation sites are profiled offline and RiskRelief uses heuristics to classify allocation sites as DRAM and HBM. The proposed heuristics expose Pareto-optimal trade-offs between soft error rate (SER) and execution time. RiskRelief improves SER by 9× compared to an HBM-Only system while at the same time improving performance by 29% compared to a DRAM-Only system. Compared to a state-of-the-art OS approach for reliability-aware data placement, RiskRelief eliminates all page migration overheads, which substantially improves performance while delivering similar SER. Reliability-aware garbage collection opens up a new opportunity to manage emerging HBM-DRAM memories at fine granularity while requiring no extra hardware support and leaving the programming model unchanged.
Utpal Bora, Santanu Das, Pankaj Kukreja, Saurabh Joshi, Ramakrishna Upadrasta, Sanjay Rajopadhye
ACM Transactions on Architecture and Code Optimization, Volume 17, pp 1-26; https://doi.org/10.1145/3418597

Abstract:
In the era of Exascale computing, writing efficient parallel programs is indispensable, and, at the same time, writing sound parallel programs is very difficult. Specifying parallelism with frameworks such as OpenMP is relatively easy, but data races in these programs are an important source of bugs. In this article, we propose LLOV, a fast, lightweight, language agnostic, and static data race checker for OpenMP programs based on the LLVM compiler framework. We compare LLOV with other state-of-the-art data race checkers on a variety of well-established benchmarks. We show that the precision, accuracy, and the F1 score of LLOV is comparable to other checkers while being orders of magnitude faster. To the best of our knowledge, LLOV is the only tool among the state-of-the-art data race checkers that can verify a C/C++ or FORTRAN program to be data race free.
Dennis Pinto, Jose-María Arnau, Antonio González
ACM Transactions on Architecture and Code Optimization, Volume 17, pp 1-19; https://doi.org/10.1145/3425604

Abstract:
Automatic Speech Recognition (ASR) has experienced a dramatic evolution since pioneer development of Bell Lab’s single-digit recognizer more than 50 years ago. Current ASR systems have taken advantage of the tremendous improvements in AI during the past decade by incorporating Deep Neural Networks into the system and pushing their accuracy to levels comparable to that of humans. This article describes and characterizes a representative ASR system with state-of-the-art accuracy and proposes a hardware platform capable of decoding speech in real-time with a power dissipation close to 1 Watt. The software is based on the so-called hybrid approach with a vocabulary of 200K words and RNN-based language model re-scoring, whereas the hardware consists of a commercially available low-power processor along with two accelerators used for the most compute-intensive tasks. The article shows that high performance can be obtained with very low power, enabling the deployment of these systems in extremely power-constrained environments such as mobile and IoT devices.
Anchu Rajendran, V. Krishna Nandivada
ACM Transactions on Architecture and Code Optimization, Volume 17, pp 1-26; https://doi.org/10.1145/3414469

Abstract:
Graph algorithms are widely used in various applications. Their programmability and performance have garnered a lot of interest among the researchers. Being able to run these graph analytics programs on distributed systems is an important requirement. Green-Marl is a popular Domain Specific Language (DSL) for coding graph algorithms and is known for its simplicity. However, the existing Green-Marl compiler for distributed systems (Green-Marl to Pregel) can only compile limited types of Green-Marl programs (in Pregel canonical form). This severely restricts the types of parallel Green-Marl programs that can be executed on distributed systems. We present DisGCo , the first compiler to translate any general Green-Marl program to equivalent MPI program that can run on distributed systems. Translating Green-Marl programs to MPI (SPMD/MPMD style of computation, distributed memory) presents many other exciting challenges, besides the issues related to differences in syntax, as Green-Marl gives the programmer a unified view of the whole memory and allows the parallel and serial code to be inter-mixed. We first present the set of challenges involved in translating Green-Marl programs to MPI and then present a systematic approach to do the translation. We also present a few optimization techniques to improve the performance of our generated programs. DisGCo is the first graph DSL compiler that can handle all syntactic capabilities of a practical graph DSL like Green-Marl and generate code that can run on distributed systems. Our preliminary evaluation of DisGCo shows that our generated programs are scalable. Further, compared to the state-of-the-art DH-Falcon compiler that translates a subset of Falcon programs to MPI, our generated codes exhibit a geomean speedup of 17.32×.
Yu Zhang, Xiaofei Liao, Lin Gu, Hai Jin, Kan Hu, Haikun Liu, Bingsheng He
ACM Transactions on Architecture and Code Optimization, Volume 17, pp 1-21; https://doi.org/10.1145/3416495

Abstract:
Recently, iterative graph algorithms are proposed to be handled by GPU-accelerated systems. However, in iterative graph processing, the parallelism of GPU is still underutilized by existing GPU-based solutions. In fact, because of the power-law property of the natural graphs, the paths between a small set of important vertices (e.g., high-degree vertices) play a more important role in iterative graph processing’s convergence speed. Based on this fact, for faster iterative graph processing on GPUs, this article develops a novel system, called AsynGraph , to maximize its data parallelism. It first proposes an efficient structure-aware asynchronous processing way . It enables the state propagations of most vertices to be effectively conducted on the GPUs in a concurrent way to get a higher GPU utilization ratio through efficiently handling the paths between the important vertices. Specifically, a graph sketch (consisting of the paths between the important vertices) is extracted from the original graph to serve as a fast bridge for most state propagations. Through efficiently processing this sketch more times within each round of graph processing, higher parallelism of GPU can be utilized to accelerate most state propagations. In addition, a forward-backward intra-path processing way is also adopted to asynchronously handle the vertices on each path, aiming to further boost propagations along paths and also ensure smaller data access cost. In comparison with existing GPU-based systems, i.e., Gunrock, Groute, Tigr, and DiGraph, AsynGraph can speed up iterative graph processing by 3.06–11.52, 2.47–5.40, 2.23–9.65, and 1.41–4.05 times, respectively.
Athanasios Stratikopoulos, Christos Kotselidis, John Goodacre, Mikel Luján
ACM Transactions on Architecture and Code Optimization, Volume 17, pp 1-23; https://doi.org/10.1145/3423134

Abstract:
In this article, we present FastPath_MP, a novel low-overhead and energy-efficient storage multi-path architecture that leverages FPGAs to operate transparently to the main processor and improve the performance and energy efficiency of accessing storage devices. We prototyped FastPath_MP on both Arm-FPGA Zynq 7000 SoC and Zynq UltraScale+ MPSoC and evaluated its performance against standard microbenchmarks as well as the real-world in-memory Redis database. Our results show that FastPath_MP achieves up to 82% lower latency, up to 12× higher throughput, and up to 10× more energy efficiency against the baseline storage path of the Linux kernel.
Jhe-Yu Liou, Xiaodong Wang, Stephanie Forrest, Carole-Jean Wu
ACM Transactions on Architecture and Code Optimization, Volume 17, pp 1-28; https://doi.org/10.1145/3418055

Abstract:
GPUs are a key enabler of the revolution in machine learning and high-performance computing, functioning as de facto co-processors to accelerate large-scale computation. As the programming stack and tool support have matured, GPUs have also become accessible to programmers, who may lack detailed knowledge of the underlying architecture and fail to fully leverage the GPU’s computation power. GEVO (Gpu optimization using EVOlutionary computation) is a tool for automatically discovering optimization opportunities and tuning the performance of GPU kernels in the LLVM representation. GEVO uses population-based search to find edits to GPU code compiled to LLVM-IR and improves performance on desired criteria while retaining required functionality. We demonstrate that GEVO improves the execution time of general-purpose GPU programs and machine learning (ML) models on NVIDIA Tesla P100. For the Rodinia benchmarks, GEVO improves GPU kernel runtime performance by an average of 49.48% and by as much as 412% over the fully compiler-optimized baseline. If kernel output accuracy is relaxed to tolerate up to 1% error, GEVO can find kernel variants that outperform the baseline by an average of 51.08%. For the ML workloads, GEVO achieves kernel performance improvement for SVM on the MNIST handwriting recognition (3.24×) and the a9a income prediction (2.93×) datasets with no loss of model accuracy. GEVO achieves 1.79× kernel performance improvement on image classification using ResNet18/CIFAR-10, with less than 1% model accuracy reduction.
Gokul Subramanian Ravi, Joshua San Miguel, Mikko Lipasti
ACM Transactions on Architecture and Code Optimization, Volume 17, pp 1-26; https://doi.org/10.1145/3412375

Abstract:
A key requirement for efficient general purpose approximate computing is an amalgamation of flexible hardware design and intelligent application tuning, which together can leverage the appropriate amount of approximation that the applications engender and reap the best efficiency gains from them. To achieve this, we have identified three important features to build better general-purpose cross-layer approximation systems: ① individual per-operation (“spatio-temporally fine-grained”) approximation, ② hardware-cognizant application tuning for approximation, ③ systemwide approximation-synergy. We build an efficient general purpose approximation system called SHASTA: Synergic HW-SW Architecture for Spatio-Temporal Approximation, to achieve these goals. 1 First, in terms of hardware, SHASTA approximates both compute and memory—SHASTA proposes (a) a form of timing approximation called Slack-control Approximation, which controls the computation timing of each approximation operation and (b) a Dynamic Pre-L1 Load Approximation mechanism to approximate loads prior to cache access. These hardware mechanisms are designed to achieve fine-grained spatio-temporally diverse approximation. Next, SHASTA proposes a Hardware-cognizant Approximation Tuning mechanism to tune an application’s approximation to achieve the optimum execution efficiency under the prescribed error tolerance. The tuning mechanism is implemented atop a gradient descent algorithm and, thus, the application’s approximation is tuned along the steepest error vs. execution efficiency gradient. Finally, SHASTA is designed with a full-system perspective, which achieves Synergic benefits across its optimizations, building a closer-to-ideal general purpose approximation system. SHASTA is implemented on top of an OOO core and achieves mean speedups/energy savings of 20%–40% over a non-approximate baseline for greater than 90% accuracy—these benefits are substantial for applications executing on a traditional general purpose processing system. SHASTA can be tuned to specific accuracy constraints and execution metrics and is quantitatively shown to achieve 2–15× higher benefits, in terms of performance and energy, compared to prior work.
Rolando Brondolin, Marco D. Santambrogio
ACM Transactions on Architecture and Code Optimization, Volume 17, pp 1-26; https://doi.org/10.1145/3418899

Abstract:
Microservices changed cloud computing by moving the applications’ complexity from one monolithic executable to thousands of network interactions between small components. Given the increasing deployment sizes, the architectural exploitation challenges, and the impact on data-centers’ power consumption, we need to efficiently track this complexity. Within this article, we propose a black-box monitoring approach to track microservices at scale, focusing on architectural metrics, power consumption, application performance, and network performance. The proposed approach is transparent w.r.t. the monitored applications, generates less overhead w.r.t. black-box approaches available in the state-of-the-art, and provides fine-grain accurate metrics.
George Christou, Giorgos Vasiliadis, Vassilis Papaefstathiou, Antonis Papadogiannakis, Sotiris Ioannidis
ACM Transactions on Architecture and Code Optimization, Volume 17, pp 1-26; https://doi.org/10.1145/3419841

Abstract:
Instruction Set Randomization (ISR) is able to protect against remote code injection attacks by randomizing the instruction set of each process. Thereby, even if an attacker succeeds to inject code, it will fail to execute on the randomized processor. The majority of existing ISR implementations is based on emulators and binary instrumentation tools that unfortunately: (i) incur significant runtime performance overheads, (ii) limit the ease of deployment, (iii) cannot protect the underlying operating system kernel, and (iv) are vulnerable to evasion attempts that bypass the ISR protection itself. To address these issues, we present the design and implementation of ASIST, an architecture with both hardware and operating system support for ISR. ASIST uses our extended SPARC processor that is mapped onto a FPGA board and runs our modified Linux kernel to support the new features. In particular, before executing a new user-level process, the operating system loads its randomization key into a newly defined register, and the modified processor decodes the process’s instructions with this key. Besides that, ASIST uses a separate randomization key for the operating system to protect the base system against attacks that exploit kernel vulnerabilities to run arbitrary code with elevated privileges. Our evaluation shows that ASIST can transparently protect both user-land applications and the operating system kernel from code injection and code reuse attacks, with about 1.5% runtime overhead when using simple encryption schemes, such as XOR and Transposition; more secure ciphers, such as AES, even though they are much more complicated for mapping them to hardware, they are still within acceptable margins,with approximately 10% runtime overhead, when efficiently leveraging the spatial locality of code through modern instruction cache configurations.
Xinfeng Xie, Li Shuangchen, Peng Gu, Shuangchen Li, Yu Ji, Yuan Xie
ACM Transactions on Architecture and Code Optimization, Volume 17, pp 1-25; https://doi.org/10.1145/3417709

Abstract:
The tremendous impact of deep learning algorithms over a wide range of application domains has encouraged a surge of neural network (NN) accelerator research. Facilitating the NN accelerator design calls for guidance from an evolving benchmark suite that incorporates emerging NN models. Nevertheless, existing NN benchmarks are not suitable for guiding NN accelerator designs. These benchmarks are either selected for general-purpose processors without considering unique characteristics of NN accelerators or lack quantitative analysis to guarantee their completeness during the benchmark construction, update, and customization. In light of the shortcomings of prior benchmarks, we propose a novel benchmarking methodology for NN accelerators with a quantitative analysis of application performance features and a comprehensive awareness of software-hardware co-design. Specifically, we decouple the benchmarking process into three stages: First, we characterize the NN workloads with quantitative metrics and select the representative applications for the benchmark suite to ensure diversity and completeness. Second, we refine the selected applications according to the customized model compression techniques provided by specific software-hardware co-design. Finally, we evaluate a variety of accelerator designs on the generated benchmark suite. To demonstrate the effectiveness of our benchmarking methodology, we conduct a case study of composing an NN benchmark from the TensorFlow Model Zoo and compress these selected models with various model compression techniques. Finally, we evaluate compressed models on various architectures, including GPU, Neurocube, DianNao, and Cambricon-X.
Cristóbal Ramírez, César Alejandro Hernández, Oscar Palomar, Osman Unsal, Marco Antonio Ramírez, Adrián Cristal
ACM Transactions on Architecture and Code Optimization, Volume 17, pp 1-30; https://doi.org/10.1145/3422667

Abstract:
Vector architectures lack tools for research. Consider the gem5 simulator, which is possibly the leading platform for computer-system architecture research. Unfortunately, gem5 does not have an available distribution that includes a flexible and customizable vector architecture model. In consequence, researchers have to develop their own simulation platform to test their ideas, which consume much research time. However, once the base simulator platform is developed, another question is the following: Which applications should be tested to perform the experiments? The lack of Vectorized Benchmark Suites is another limitation. To face these problems, this work presents a set of tools for designing and evaluating vector architectures. First, the gem5 simulator was extended to support the execution of RISC-V Vector instructions by adding a parameterizable Vector Architecture model for designers to evaluate different approaches according to the target they pursue. Second, a novel Vectorized Benchmark Suite is presented: a collection composed of seven data-parallel applications from different domains that can be classified according to the modules that are stressed in the vector architecture. Finally, a study of the Vectorized Benchmark Suite executing on the gem5-based Vector Architecture model is highlighted. This suite is the first in its category that covers the different possible usage scenarios that may occur within different vector architecture designs such as embedded systems, mainly focused on short vectors, or High-Performance-Computing (HPC), usually designed for large vectors.
Page of 16
Articles per Page
by
Show export options
  Select all
Back to Top Top