ASPDAC '21: 26th Asia and South Pacific Design Automation Conference

Conference Information
Name: ASPDAC '21: 26th Asia and South Pacific Design Automation Conference
Location: Tokyo, Japan

Latest articles from this conference

Yu Ma, Pingqiang Zhou
Proceedings of the 26th Asia and South Pacific Design Automation Conference; https://doi.org/10.1145/3394885.3431555

Abstract:
Speed and energy consumption are two important metrics in designing spiking neural networks (SNNs). The inference process of current SNNs is terminated after a preset number of time steps for all images, which leads to a waste of time and spikes. We can terminate the inference process after proper number of time steps for each image. Besides, normalization method also influences the time and spikes of SNNs. In this work, we first use reinforcement learning algorithm to develop an efficient termination strategy which can help find the right number of time steps for each image. Then we propose a model tuning technique for memristor-based crossbar circuit to optimize the weight and bias of a given SNN. Experimental results show that the proposed techniques can reduce about 58.7% crossbar energy consumption and over 62.5% time consumption and double the drift lifetime of memristor-based SNN.
Peter Toth, Hiroki Ishikuro
Proceedings of the 26th Asia and South Pacific Design Automation Conference; https://doi.org/10.1145/3394885.3431644

Abstract:
This work presents a novel control loop concept to adjust dynamically a differential ring oscillators (DRO) biasing in order to improve the phase noise performance (PN) in the ultra-low-power domain. Applying this proposed feedback system on any DRO with a tail current source is possible. The following paper presents the proposed concept and includes measurements of a 180 nm CMOS integrated prototype system, which underlines the feasibility of the discussed idea. Measurements show an up to 35 dBc/Hz phase noise improvement with an active control loop. Moreover, the tuning range of the implemented ring oscillator is extended by about 430 % compared to fixed bias operation. These values are measured at a minimum oscillation power consumption of 55 pW/Hz. University LSI Design Contest ASP-DAC 2021
Uday Mallappa, Chung-Kuan Cheng
Proceedings of the 26th Asia and South Pacific Design Automation Conference; https://doi.org/10.1145/3394885.3431574

Abstract:
Static power consumption is a critical challenge for IC designs, particularly for mobile and IoT applications. A final post-layout step in modern design flows involves a leakage recovery step that is embedded in signoff static timing analysis tools. The goal of such recovery is to make use of the positive slack (if any) and recover the leakage power by performing cell swaps with footprint compatible variants. Though such swaps result in unaltered routing, the hard constraint is not to introduce any new timing violations. This process can require up to tens of hours of runtime, just before the tapeout, when schedule and resource constraints are tightest. The physical design teams can benefit greatly from a fast predictor of the leakage recovery step: if the eventual recovery will be too small, the entire step can be skipped, and the resources can be allocated elsewhere. If we represent the circuit netlist as a graph with cells as vertices and nets connecting these cells as edges, the leakage recovery step is an optimization step, on this graph. If we can learn these optimizations over several graphs with various logic-cone structures, we can generalize the learning to unseen graphs. Using graph convolution neural networks, we develop a learning-based model, that predicts per-cell recoverable slack, and translate these slack values to equivalent power savings. For designs up to 1.6M instances, our inference step takes less than 12 seconds on a Tesla P100 GPU, and an additional feature extraction, post-processing steps consuming 420 seconds. The model is accurate with relative error under 6.2%, for the design-specific context.
Li-Cheng Zheng, Hao-Ju Chang, Yung-Chih Chen, Jing-Yang Jou
Proceedings of the 26th Asia and South Pacific Design Automation Conference; https://doi.org/10.1145/3394885.3431558

Abstract:
This paper introduces a method to enhance an integer linear programming (ILP)-based method for transforming a 1st-order threshold logic gate (1-TLG) to a 2nd-order TLG (2-TLG) with lower area cost. We observe that for a 2-TLG, most of the 2nd-order weights (2-weights) are zero. That is, in the ILP formulation, most of the variables for the 2-weights could be set to zero. Thus, we first propose three sufficient conditions for transforming a 1-TLG to a 2-TLG by extracting 2-weights. These extracted weights are seen to be more likely non-zero. Then, we simplify the ILP formulation by eliminating the non-extracted 2-weights to speed up the ILP solving. The experimental results show that, to transform a set of 1-TLGs to 2-TLGs, the enhanced method saves an average of 24% CPU time with only an average of 1.87% quality loss in terms of the area cost reduction rate.
Hsuan Hsiao, Joshua San Miguel, Yuko Hara-Azumi, Jason Anderson
Proceedings of the 26th Asia and South Pacific Design Automation Conference; https://doi.org/10.1145/3394885.3431552

Abstract:
Stochastic computing (SC), with its probabilistic data representation format, has sparked renewed interest due to its ability to use very simple circuits to implement complex operations. Though unlike traditional binary computing, SC needs to carefully handle correlations that exist across data values to avoid the risk of unacceptably inaccurate results. With many SC circuits designed to operate under the assumption that input values are independent, it is important to provide the ability to accurately measure and characterize independence of SC bitstreams. We propose zero correlation error (ZCE), a metric that quantifies how independent two finite-length bitstreams are, and show that it addresses fundamental limitations in metrics currently used by the SC community. Through evaluation at both the functional unit level and application level, we demonstrate how ZCE can be an effective tool for analyzing SC bitstreams, simulating circuits and design space exploration.
Kit Seng Tam, Chia-Chun Lin, Yung-Chih Chen, Chun-Yao Wang
Proceedings of the 26th Asia and South Pacific Design Automation Conference; https://doi.org/10.1145/3394885.3431550

Abstract:
Approximate computing is an emerging design paradigm for error-tolerant applications. e.g., signal processing and machine learning. In approximate computing, the area, delay, or power consumption of an approximate circuit can be improved by trading off its accuracy. In this paper, we propose an approximate logic synthesis approach based on a node-merging technique with an error rate guarantee. The ideas of our approach are to replace internal nodes by constant values and to merge two similar nodes in the circuit in terms of functionality. We conduct experiments on a set of IWLS 2005 and MCNC benchmarks. The experimental results show that our approach can reduce area by up to 80%, and 31% on average. As compared with the state-of-the-art method, our approach has a speedup of 51 under the same 5% error rate constraint.
Saru Vig, Siew-Kei Lam, Rohan Juneja
Proceedings of the 26th Asia and South Pacific Design Automation Conference; https://doi.org/10.1145/3394885.3431593

Abstract:
Memory integrity trees are widely-used to protect external memories in embedded systems against bus attacks. However, existing methods often result in high performance overheads incurred during memory authentication. To reduce memory accesses during authentication, the tree nodes are cached on-chip. In this paper, we propose a cacheaware technique to dynamically skew the integrity tree based on the application workloads in order to reduce the performance overhead. The tree is initialized using Van-Emde Boas (vEB) organization to take advantage of locality of reference. At run time, the nodes of the integrity tree are dynamically positioned based on their memory access patterns. In particular, frequently accessed nodes are placed closer to the root to reduce the memory access overheads. The proposed technique is compared with existing methods on Multi2Sim using benchmarks from SPEC-CPU2006, SPLASH-2 and PARSEC to demonstrate its performance benefits.
Omar Ragheb, Jason H. Anderson
Proceedings of the 26th Asia and South Pacific Design Automation Conference; https://doi.org/10.1145/3394885.3431556

Abstract:
The rising popularity of high-level synthesis (HLS) is due to the complexity and amount of background knowledge required to design hardware circuits. Despite significant recent advances in HLS research, HLS-generated circuits may be of lower quality than human-expert-designed circuits, from the performance, power, or area perspectives. In this work, we aim to raise circuit performance by introducing a transactional memory (TM) synchronization model to the open-source LegUp HLS tool [1]. LegUp HLS supports the synthesis of multi-threaded software into parallel hardware [4], including support for mutual-exclusion lock-based synchronization. With the introduction of transactional memory-based synchronization, location-specific (i.e. finer grained) memory locks are made possible, where instead of placing an access lock around an entire array, one can place a lock around individual array elements. Significant circuit performance improvements are observed through reduced stalls due to contention, and greater memory-access parallelism. On a set of 5 parallel benchmarks, wall-clock time is improved by 2.0x, on average, by the TM synchronization model vs. mutex-based locks.
Zhaojun Lu, Tanvir Arafin, Gang Qu
Proceedings of the 26th Asia and South Pacific Design Automation Conference; https://doi.org/10.1145/3394885.3431524

Abstract:
Processing in-memory (PIM) is an emerging technology poised to break the memory-wall in the conventional von Neumann architecture. PIM reduces data movement from the memory systems to the CPU by utilizing memory cells for logic computation. However, existing PIM designs do not support high precision computation (e.g., floating-point operations) essential for critical data-intensive applications. Furthermore, PIM architectures require complex control module and costly peripheral circuits to harness the full potential of in-memory computation. These peripherals and control modules usually suffer from scalability and efficiency issues. Hence, in this paper, we explore the analog properties of the resistive random access memory (RRAM) crossbar and propose a scalable RRAM-based in-memory floating-point computation architeture (RIME). RIME uses single-cycle NOR, NAND, and Minority logic to achieve floating-point operations. RIME features a centralized control module and a simplified peripheral circuit to eliminate data movement during parallel computation. An experimental 32-bit RIME multiplier demonstrates 4.8X speedup, 1.9X area-improvement, and 5.4X energy-efficiency than state-of-the-art RRAM-based PIM multipliers.
Lamija Hasanagić, Tin Vidović, Saad Mubeen, Mahammad Ashjaei, Matthias Becker
Proceedings of the 26th Asia and South Pacific Design Automation Conference; https://doi.org/10.1145/3394885.3431515

Abstract:
This paper addresses the scheduling of industrial time-critical applications on multi-core embedded systems. A novel scheduling technique under partitioned scheduling is proposed that minimizes inter-core data-propagation delays between tasks that are activated with different periods. The proposed technique is based on the read-execute-write model for the execution of tasks to guarantee temporal isolation when accessing the shared resources. A Constraint Programming formulation is presented to find the schedule for each core. Evaluations are preformed to assess the scalability as well as the resulting schedulability ratio, which is still 18% for two cores that are both utilized 90%. Furthermore, an automotive industrial case study is performed to demonstrate the applicability of the proposed technique to industrial systems. The case study also presents a comparative evaluation of the schedules generated by (i) the proposed technique and (ii) the Rubus-ICE industrial tool suite with respect to jitter, inter-core data-propagation delays and their impact on data age of task chains that span multiple cores.
Back to Top Top