#### IEEE Transactions on Parallel and Distributed Systems

Journal Information
ISSN / EISSN: 10459219 / 15582183
Total articles ≅ 5,028

#### Latest articles in this journal

Published: 29 November 2022
IEEE Transactions on Parallel and Distributed Systems, pp 1-14; https://doi.org/10.1109/tpds.2022.3225274

Abstract:
The wide use of internet-connected services makes massive personal data collected by service providers without the need of our consent. Although the archived data may enable them to provide better service experiences for users, it also presents serious risks to individual privacy, especially when active or unexpected data breaches have become commonplace. To mitigate this issue, several acts and regulations (e.g., the European Union general data protection regulation) have been issued and specified a lot of security requirements for personal data management. Among these various requirements, we mainly focus on the requirement of giving back the access control of personal data to data owners themselves and the right to be forgotten for data erasure. In this paper, we provide a cryptographic solution of achieving these two requirements in the setting of outsourced storage. Specifically, we introduce a personal data management framework built upon a novel cryptographic primitive dubbed as forward-secure attribute-based puncturable encryption (FS-DABPE). This primitive simultaneously features of system-wide forward secrecy and practical key management as well as fine-grained access control of the encrypted personal data. Consequently, by locally puncturing, updating and erasing system-wide secret keys, it securely realizes fine-grained personal data sharing and data erasure without interactions. Furthermore, to instantiate the proposed framework, we present a concrete FS-DABPE construction, and prove its security under a well-studied complexity assumption. In addition, we provide a prototype implementation of the concrete construction, and present extensive experimental results that illustrate its feasibility and practicability.
Marcin Rogowski, Samar A. Aseeri, ,
Published: 29 November 2022
IEEE Transactions on Parallel and Distributed Systems, pp 1-12; https://doi.org/10.1109/tpds.2022.3225481

Abstract:
We present mpi4py.futures, a lightweight, asynchronous task execution framework targeting the Python programming language and using the Message Passing Interface (MPI) for interprocess communication. mpi4py.futures follows the interface of the concurrent.futures package from the Python standard library and can be used as its drop-in replacement, while allowing applications to scale over multiple compute nodes. We discuss the design, implementation, and feature set of mpi4py.futures and compare its performance to other solutions on both shared and distributed memory architectures. On a shared-memory system, we show mpi4py.futures to consistently outperform Python's concurrent.futures with speedup ratios between 1.4X and 3.7X in throughput (tasks per second) and between 1.9X and 2.9X in bandwidth. On a Cray XC40 system, we compare mpi4py.futures to Dask – a well-known Python parallel computing package. Although we note more varied results, we show mpi4py.futures to outperform Dask in most scenarios.
, Gelin Fu, Chenyang Li, Zhongpei Luo, Lucheng Zhang, Ruiyang Chen, Wenzhe Zhao, Nanning Zheng,
Published: 28 November 2022
IEEE Transactions on Parallel and Distributed Systems, pp 1-15; https://doi.org/10.1109/tpds.2022.3225230

Abstract:
Sparse Matrix-Vector Multiplication (SpMV) is important in scientific and industrial applications and remains a well-known challenge for modern CPUs due to high sparsity and irregularity. Many researchers try to improve SpMV performance by designing dedicated data formats and computation patterns. However, current out-of-order superscalar CPUs usually have complex micro-architectures where exist complicated interactions and restrictions among software and hardware factors. It is hard to systematically study the effect of optimization methods on the overall performance, as its benefits may be undermined by other ignored factors. In this paper, we thoroughly study the execution of SpMV on modern CPUs and propose a comprehensive performance model to reveal the critical factors and their relationships. Specifically, we first closely study the coding characteristics of SpMV kernels to identify key factors worthy of attention. Then we model the execution of SpMV as two overlapped parts: CPU pipeline and memory latency. Both are carefully modeled with related hardware and software factors. Our model also analyzes SIMD performance by considering the usage of specific SIMD instructions and vector registers. Experiments show that our model matches the actual kernel execution on real-world commercial processors. More importantly, our model provides valuable insights into critical bottlenecks and factors of SpMV. Guided by the performance model, we propose SpV8, a novel SpMV kernel that focuses to optimize several critical factors and greatly improves computation efficiency and memory bandwidth. Experiments on commercial Intel/AMD x86 and ARM AArch64 platforms show that SpV8 outperforms several state-of-the-art approaches with large margins, achieving an average $3.4\times$ over Intel Math Kernel Library and $1.4\times$ over the best existing approach. Such results indicate that the proposed model is capable of valuable guidance for efficient SpMV optimizations
, Dongshan Bai, Dezhong Yao, Yutong Dai, , , Lichao Sun
Published: 28 November 2022
IEEE Transactions on Parallel and Distributed Systems, pp 1-13; https://doi.org/10.1109/tpds.2022.3225185

Abstract:
Federated Learning (FL) is an emerging approach in edge computing for collaboratively training machine learning models among multiple devices, which aims to address limited bandwidth, system heterogeneity, and privacy issues in traditional centralized training. However, the existing federated learning methods focus on learning a shared global model for all devices, which may not always be ideal for different devices. Such situations become even worse when each edge device has its own data distribution or task. In this paper, we study personalized federated learning in which our goal is to train models to perform well for individual clients. We observe that the initialization in each communication round causes the forgetting of historical personalized knowledge. Based on this observation, we propose a novel Personalized Federated Learning (PFL) framework via self-knowledge distillation, named pFedSD. By allowing clients to distill the knowledge of previous personalized models to current local models, pFedSD accelerates the process of recalling the personalized knowledge for the latest initialized clients. Moreover, self-knowledge distillation provides different views of data in feature space to realize an implicit ensemble of local models. Extensive experiments on various datasets and settings demonstrate the effectiveness and robustness of pFedSD.
, , Beichen Ma, Chendi Zhou, Jingbo Zhou, Yang Zhou, , Dejing Dou
Published: 28 November 2022
IEEE Transactions on Parallel and Distributed Systems, pp 1-17; https://doi.org/10.1109/tpds.2022.3224941

Abstract:
Recent years have witnessed a large amount of decentralized data in various (edge) devices of end-users, while the decentralized data aggregation remains complicated for machine learning jobs because of regulations and laws. As a practical approach to handling decentralized data, Federated Learning (FL) enables collaborative global machine learning model training without sharing sensitive raw data. The servers schedule devices to jobs within the training process of FL. In contrast, device scheduling with multiple jobs in FL remains a critical and open problem. In this paper, we propose a novel multi-job FL framework, which enables the training process of multiple jobs in parallel. The multi-job FL framework is composed of a system model and a scheduling method. The system model enables a parallel training process of multiple jobs, with a cost model based on the data fairness and the training time of diverse devices during the parallel training process. We propose a novel intelligent scheduling approach based on multiple scheduling methods, including an original reinforcement learning-based scheduling method and an original Bayesian optimization-based scheduling method, which corresponds to a small cost while scheduling devices to multiple jobs. We conduct extensive experimentation with diverse jobs and datasets. The experimental results reveal that our proposed approaches significantly outperform baseline approaches in terms of training time (up to 12.73 times faster) and accuracy (up to 46.4% higher).
Mingyu Liu, ,
Published: 25 November 2022
Ieee Transactions on Parallel and Distributed Systems, pp 1-17; https://doi.org/10.1109/tpds.2022.3224865

Abstract:
The cloud storage boom has prompted providers to offer two storage tiers, i.e., hot and cold tiers, which are respectively purpose-built to provide the lowest cost for frequent and infrequent access patterns. However, for cloud users, it is non-trivial to determine cost-effective tiers because it is hard to obtain future access patterns in advance and is difficult to predict them exactly. The lack of future information poses a risk of increasing costs instead of saving costs. This is not the only challenge encountered when it comes to cost optimization. In this paper, we take Amazon S3 as an example to analyze the pricing of two-tier cloud storage and derive several major challenges faced by cost optimization. Then, assuming a priori knowledge of future access patterns, we propose an optimal offline algorithm based on dynamic programming to determine cost-effective tiers for each time slot. Further, to handle online workload arrivals, we formulate the problem using Markov decision processes and propose RLTiering based on deep reinforcement learning. Eventually, the cost performance of RLTiering is evaluated based on real-world traces and prevalent Amazon S3 pricing, and the results show that it achieves significant cost-savings.
, , , Gerhard Wellein
Published: 21 November 2022
IEEE Transactions on Parallel and Distributed Systems, pp 1-18; https://doi.org/10.1109/tpds.2022.3223512

Abstract:
The multiplication of a sparse matrix with a dense vector (SpMV) is a key component in many numerical schemes and its performance is known to be severely limited by main memory access. Several numerical schemes require the multiplication of a sparse matrix polynomial with a dense vector which is typically implemented as a sequence of SpMVs. This results in low performance and ignores the potential to increase the arithmetic intensity by reusing the matrix data from cache. In this work we use the recursive algebraic coloring engine (RACE) to enable blocking of sparse matrix data across the polynomial computations. In the graph representing the sparse matrix we form levels using a breadth-first search. Locality relations of these levels are then used to improve spatial and temporal locality when accessing the matrix data and to implement an efficient multithreaded parallelization. Our approach is independent of the matrix structure and avoids shortcomings of existing “blocking” strategies in terms of hardware efficiency and parallelization overhead. We quantify the quality of our implementation using performance modelling and demonstrate speedups of up to 3× and 5× compared to an optimal SpMV-based baseline on a single multicore chip of recent Intel and AMD architectures. Various numerical schemes like s -step Krylov solvers, polynomial preconditioners and power clustering algorithms will benefit from our development.
, , Jiaming Song, Honghu Liu
Published: 21 November 2022
IEEE Transactions on Parallel and Distributed Systems, pp 1-12; https://doi.org/10.1109/tpds.2022.3223796

Abstract:
Fog Computing usefully extends Cloud to the edge of the network for the sake of meeting users' expanding demand for low latency. However, due to its scattered distribution and open architecture, fog nodes are highly vulnerable to security threats, resulting in an inevitable sharp conflict between quick response time and high data security. This conflict motivates the need for effective data placement among fog nodes towards a trade-off between security and time. Existing studies merely offer independent solutions by considering either security or response time. By contrast, we establish a dynamic multi-objective optimization model in this paper by optimizing security and response time simultaneously. With this model, we propose an efficient evolutionary algorithm, referred to as Dynamic Interactive Security-and-Time cognizant algorithm ( DIST ), to obtain optimal data placement strategies under Fog environments. To improve efficiency, DIST allows users to gradually incorporate their preference information into the search process so as to find their most preferred solutions without exploring the whole search space. We demonstrate the superiority of DIST by rigorous comparison with the most state-of-art data placement strategy and other well-applied strategies. Experimental results manifest that DIST outperforms other strategies in obtaining solutions with higher data security and shorter response time. Furthermore, DIST is capable of efficiently and continuously tracking the Pareto optimal solution under dynamically changing Fog environments while other existing strategies cannot.
Shuai Zhang, Zite Jiang, Xingzhong Hou, Mingyu Li, , Haihang You
Published: 18 November 2022
IEEE Transactions on Parallel and Distributed Systems, pp 1-12; https://doi.org/10.1109/tpds.2022.3223068

Abstract:
Nowadays, the ever-increasing volume of graph-structured data such as social networks, graph databases and knowledge graphs requires to be processed efficiently and scalably. These natural graphs commonly found in the real world have highly skewed power-law degree distribution and are called power-law graphs. The subgraph-centric programming model is a promising approach applied in many state-of-the-art distributed graph computing frameworks. However, the performance of subgraph-centric frameworks is limited when processing large-scale power-law graphs. When deployed to the subgraph-centric framework, existing graph partitioning algorithms are not suitable for power-law graphs. In this paper, we present a novel distributed graph computing framework, DRONE (Distributed gRaph cOmputiNg Engine), which leverages the subgraph-centric model and the vertex-cut graph partitioning strategy. DRONE also supports the fault tolerance mechanism to accommodate the increasing scale of machines with negligible overhead (6.48% on average). We further study the execution workflow of DRONE and propose an efficient and balanced graph partition algorithm (EBV) for DRONE. Experiments show that DRONE reduces the running time on real-world graphs by 25.6%, on average, compared to the state-of-the-art distributed graph computing frameworks. In addition, the EBV graph partition algorithm reduces the replication factor by at least 21.8% than other self-based partition algorithms. Our results indicate that DRONE has excellent potential in processing large-scale power-law graphs.
Published: 17 November 2022
IEEE Transactions on Parallel and Distributed Systems, pp 1-13; https://doi.org/10.1109/tpds.2022.3222765

Abstract:
Deploying high-performance vision transformer (ViT) models on ubiquitous Internet of Things (IoT) devices to provide high-quality vision services will revolutionize the way we live, work, and interact with the world. Due to the contradiction between the limited resources of IoT devices and resource-intensive ViT models, the use of cloud servers to assist ViT model training has become mainstream. However, due to the larger number of parameters and floating-point operations (FLOPs) of the existing ViT models, the model parameters transmitted by cloud servers are large and difficult to run on resource-constrained IoT devices. To this end, this paper proposes a transmission-friendly ViT model, TFormer, for deployment on resource-constrained IoT devices with the assistance of a cloud server. The high performance and small number of model parameters and FLOPs of TFormer are attributed to the proposed hybrid layer and the proposed partially connected feed-forward network (PCS-FFN). The hybrid layer consists of nonlearnable modules and a pointwise convolution, which can obtain multitype and multiscale features with only a few parameters and FLOPs to improve the TFormer performance. The PCS-FFN adopts group convolution to reduce the number of parameters. The key idea of this paper is to propose TFormer with few model parameters and FLOPs to facilitate applications running on resource-constrained IoT devices to benefit from the high performance of the ViT models. Experimental results on the ImageNet-1K, MS COCO, and ADE20K datasets for image classification, object detection, and semantic segmentation tasks demonstrate that the proposed model outperforms other state-of-the-art models. Specifically, TFormer-S achieves 5% higher accuracy on ImageNet-1K than ResNet18 with 1.4× fewer parameters and FLOPs.