Refine Search

New Search

Result: 7,967,949

(searched for: conference:*)
Page of 796,795
Articles per Page
by
Show export options
  Select all
Alexandre Rodrigues, João Carlos Resende, Ricardo Chaves
Proceedings of the 19th ACM International Conference on Computing Frontiers; https://doi.org/10.1145/3528416.3530998

Abstract:
Dedicated computational devices such as HSMs and FPGAs are frequently used to provide data security and privacy. However, these options have several drawbacks, particularly when considering IoT environments. HSMs offer high-grade services but are costly and lack application flexibility, while FPGAs, in general, are cheaper and adaptable, but lack security services and protection. Herein, the SmartFusion2 SoC FPGA, a security-oriented system, is evaluated as a possible low-cost and flexible platform for security modules for the IoT. This work analyzes the several security services of the SmartFusion2 SoC, their advantages, and possible trade-offs. To demonstrate the SoC viability as a security module and/or a more adaptable HSM alternative, several case study applications are considered and analyzed to elaborate on the potential, limitations, and mitigations of the latter.
Davide Gadioli, Emanuele Vitali, Federico Ficarelli, Chiara Latini, Candida Manelfi, Carmine Talarico, Cristina Silvano, Andrea. R. Beccari, Gianluca Palermo
Proceedings of the 19th ACM International Conference on Computing Frontiers; https://doi.org/10.1145/3528416.3530872

Abstract:
Virtual screening is one of the early stages that aims to select a set of promising ligands from a vast chemical library. Molecular Docking is a crucial task in the process of drug discovery and it consists of the estimation of the position of a molecule inside the docking site. In the contest of urgent computing, we designed from scratch the EXSCALATE molecular docking platform to benefit from heterogeneous computation nodes and to avoid scaling issues. This poster presents the achievements and ongoing development of the EXSCALATE platform, together with an example of usage in the context of the COVID-19 pandemic.
Bingchao Li, Jizeng Wei
Proceedings of the 19th ACM International Conference on Computing Frontiers; https://doi.org/10.1145/3528416.3530229

Abstract:
The on-chip memories of GPUs, including the register file, shared memory and L1 cache, can provide high bandwidth and low latency access for the temporary storage of data. The capacity of L1 cache can be increased by using the registers/shared memory that are unassigned to any warps/thread blocks or released after warps/thread blocks are finished as cache-lines. In this paper, we propose two techniques to manage requests for on-chip memories to improve the efficiency of L1 cache on the base of leveraging registers and shared memory as cache-lines. Specifically, we develop a data transferring policy which is triggered when cache-lines are recalled by the first register or shared memory accesses of warps that are newly launched to prevent the data locality from being destroyed. Additionally, we design a parallel issue scheme by exploring the parallel feature of requests of an instruction accessing the register file, shared memory and L1 cache to decrease the processing latency and hence increase the throughput of instructions. The experimental results demonstrate that our approach improves the performance by 15% over prior work.
Gaurav Verma, Swetang Finviya, Abid M. Malik, Murali Emani, Barbara Chapman
Proceedings of the 19th ACM International Conference on Computing Frontiers; https://doi.org/10.1145/3528416.3530251

Abstract:
Deep Neural Networks (DNN) form the basis for many existing and emerging applications. Many DL compilers analyze the computation graphs and apply various optimizations at different stages. These high-level optimizations are applied using compiler passes before feeding the resultant computation graph for low-level and hardware-specific optimizations. With advancements in DNN architectures and backend hardware, the search space of compiler optimizations has grown manifolds. Also, the inclusion of passes without the knowledge of the computation graph leads to increased execution time with a slight influence on the intermediate representation. This paper presents preliminary results 1) summarizing the relevance of pass selection and ordering in a DL compiler, 2) neural architecture-aware selection of optimization passes, and 3) pruning search space for the phase selection problem in a DL compiler. We use TVM as a compiler to demonstrate the experimental results on Nvidia A100 and GeForce RTX 2080 GPUs, establishing the relevance of neural architecture-aware selection of optimization passes for DNNs DL compilers. Experimental evaluation with seven models categorized into four architecturally different classes demonstrated performance gains for most neural networks. For ResNets, the average throughput increased by 24% and 32% for TensorFlow and PyTorch frameworks, respectively. Additionally, we observed an average 15% decrease in the compilation time for ResNets, 45% for MobileNet, and 54% for SSD-based models without impacting the throughput. BERT models showed a dramatic improvement with a 92% reduction in the compile time.
Kunyu Zhou, Keni Qiu
Proceedings of the 19th ACM International Conference on Computing Frontiers; https://doi.org/10.1145/3528416.3530238

Abstract:
As the Internet of Things (IoTs) increasingly combines AI technology, it is a trend to deploy neural network algorithms at edges and make IoT devices more intelligent than ever. Moreover, the energy harvesting technology-based IoT devices have shown the advantages of green economy, convenient maintenance, and theoretically infinite lifetime, etc. However, the harvested energy is often unstable, resulting in low performance due to the fact that a fixed load can't sufficiently utilize the harvested energy. To address this problem, recent works focusing on ReRAM-based convolutional neural networks (CNN) accelerators under harvested energy have proposed hardware/software optimizations. However, those works have overlooked the mismatch between the power requirement of different CNN layers and the variation of harvested power. Motivated by the above observation, this paper proposes a novel strategy, called REC, that retimes convolutional layers of CNN inferences to improve the performance and energy efficiency of energy harvesting ReRAM-based accelerators. Specifically, at the offline stage, REC defines different power levels to match the power requirements of different convolutional layers. At runtime, instead of sequentially executing the convolutional layers of a reference one by one, REC retimes the execution timeframe of different convolutional layers so as to accommodate different CNN layers to the changing power inputs. What is more, REC provides a parallel strategy to fully utilize very high power income. Our experimental results show that the proposed REC approach achieves an average performance improvement of 6.1x (up to 16.5x) compared to the traditional strategy without the REC idea.
Alvee Noor, Kenneth Kent, Kazuhiro Konno, Daryl Maier
Proceedings of the 19th ACM International Conference on Computing Frontiers; https://doi.org/10.1145/3528416.3530233

Abstract:
Just-in-time (JIT) compilers achieve application portability and improved management of large code-bases by abstracting the architecture specific details from programmers. Eclipse OMR and Eclipse OpenJ9 invest extensively in JIT technology to efficiently execute architecture-neutral Java bytecode. OMR is a robust language runtime builder, and OpenJ9 is a managed language runtime that consumes OMR. The targeted domains of OMR and OpenJ9 include AArch64, a 64-bit version of the ARM architecture. AArch64 is a popular member of the embedded computing market, where computing infrastructure resources (e.g., CPU, memory) are constrained. The concept of SIMD (Single Instruction, Multiple Data) instructions primarily evolved to accelerate the performance of multimedia applications such as motion video, real-time physics and graphics where repetitive operations were involved on large arrays of numbers. This paper discusses the steps taken to add SIMD support to OMR for AArch64. The implementation of advanced SIMD and floating-point instructions are also discussed, which cover vectorized mathematical operations, including addition, subtraction, multiplication, and division for supported data-types. We validate our implementation through relevant OMR tril tests, present two microbenchmarks VectorizationMicrobenchmark and Sepia Tone Filter and a set of standard benchmarks, which leverage the OpenJ9 autovectorization process in AArch64. The AArch64 vectorized operations are evaluated against non-vectorized, but similar operations using Eclipse OpenJ9. We demonstrate an improvement of up to four times in execution speed of certain vector arithmetic operations.
Na Lin, Hongzhi Qin, Junling Shi, Liang Zhao
Proceedings of the 19th ACM International Conference on Computing Frontiers; https://doi.org/10.1145/3528416.3530239

Abstract:
Device-to-device (D2D) content caching is a promising technology to mitigate the backhaul pressure, and reduce the contents transmission delay. In this paper, to improve the content hit rate (CHR) and the utilization efficiency of the limited caching capacity, we put forward a caching content placement strategy by predicting the user preference and the content popularity, where unmanned aerial vehicles (UAVs) are introduced into the D2D networks to provide computation offloading services to the users. A dynamic resource allocation optimization algorithm (DRAOA) is proposed to deploy UAVs and plan UAVs trajectory adaptively according to the users' task requirements. Simulation results show that the proposed caching content placement policy outperforms the existing baselines. Additionally, the DRAOA can effectively improve the network capacity and mitigate the computation delay compared to the other two DRL algorithms.
Mohsen Seyedkazemi Ardebili, Andrea Bartolini, Luca Benini
Proceedings of the 19th ACM International Conference on Computing Frontiers; https://doi.org/10.1145/3528416.3530864

Abstract:
Modern scientific discoveries are driven by an unsatisfiable demand for computational resources. To solve large problems in science, engineering, and business, data centers provide High-Performance Computing (HPC) systems with aggregation of the computing capacity of thousand of computing nodes. Anomaly prediction is critical in order to preserve the continuity of the service of HPC systems and prevent hardware deterioration. In the datacenter, a thermal anomaly occurs when the balance of cooling capacity and computational demand is disturbed. Moreover, this is identifiable from a suspicious/abnormal pattern in the monitoring signals. In this poster, the anomaly prediction task in the HPC systems is investigated by defining complex statistical rules-based and Deep Learning DL-based anomaly detection methods, then utilizing these anomaly detection methods in an anomaly prediction framework.
Haizhou Du, Sheng Huang, Qiao Xiang
Proceedings of the 19th ACM International Conference on Computing Frontiers; https://doi.org/10.1145/3528416.3530246

Abstract:
The synchronized Local-SGD(Stochastic gradient descent) strategy becomes a more popular in distributed deep learning (DML) since it can effectively reduce the frequency of model communication and ensure global model convergence. However, it works not well and leads to excessive training time in heterogeneous environments due to the difference in workers' performance. Especially, in some data unbalanced scenarios, these differences between workers may aggravate low utilization of resources and eventually lead to stragglers, which seriously hurt the whole training procedure. Existing solutions either suffer from a heterogeneity of computing resources or do not fully address the environment dynamics. In this paper, we eliminate the negative impacts of dynamic resource constraints issues in heterogeneous DML environments with a novel, adaptive load-balancing framework called Orchestra. The main idea of Orchestra is to improve resource utilization by load balance between worker performance and the unbalance of data volume. Additionally, one of Orchestra's strongest features is the number of local updates adaptation at each epoch per worker. To achieve this improvement, we propose a distributed deep reinforcement learning-driven algorithm for per-worker to dynamically determine the number of local updates adaptation and training data volume, subject to mini-batch cost time and resource constraints at each epoch. Our design significantly improves the convergence speed of the model in DML compared with other state-of-the-art.
Murat Isik, Ankita Paul, M. Lakshmi Varshika, Anup Das
Proceedings of the 19th ACM International Conference on Computing Frontiers; https://doi.org/10.1145/3528416.3530232

Abstract:
We propose a design methodology to facilitate fault tolerance of deep learning models. First, we implement a many-core fault-tolerant neuromorphic hardware design, where neuron and synapse circuitries in each neuromorphic core are enclosed with astrocyte circuitries, the star-shaped glial cells of the brain that facilitate self-repair by restoring the spike firing frequency of a failed neuron using a closed-loop retrograde feedback signal. Next, we introduce astrocytes in a deep learning model to achieve the required degree of tolerance to hardware faults. Finally, we use a system software to partition the astrocyte-enabled model into clusters and implement them on the proposed fault-tolerant neuromorphic design. We evaluate this design methodology using seven deep learning inference models and show that it is both area- and power-efficient.
Page of 796,795
Articles per Page
by
Show export options
  Select all
Back to Top Top