IEEE Transactions on Computers
ISSN / EISSN: 00189340 / 15579956
Published by: Institute of Electrical and Electronics Engineers (IEEE)
Total articles ≅ 10,619
Latest articles in this journal
Published: 1 June 2023
IEEE Transactions on Computers, pp 1-14; https://doi.org/10.1109/tc.2023.3281857
Paging and exit overheads have been proven to be the performance bottlenecks when adopting Searchable Symmetric Encryption (SSE) with trusted hardware such as Intel SGX for keyword search. This problem becomes more serious when incorporating ORAM and SGX to design oblivious SSE schemes such as POSUP  and Oblidb  which can defend against inference attacks. The main reason comes from high round communication complexity of ORAM and constrained trusted memory created by SGX. To overcome this performance bottleneck, we propose a set of novel SSE constructions with realistic security/performance trade-offs. Our core idea is to encode the keyword-identifier pairs into a bloom filter to reduce the number of ORAM operations during the search procedure. Specifically, Construction 1 loads the bloom filter into the enclave sequentially, which outperforms about
when the dataset is large compared with the performance of the baseline that directly combines ORAM and SGX. To further improve the performance of Construction 1, Construction 2 classifies keywords into groups and stores these groups in different bloom filters. By additionally leaking the keywords in search token belonging to which groups, Construction 2 outperforms Construction 1 by $1.7\times$ and provides an improvement of at least one order over state-of-the-art oblivious protocols. $16.5\sim 36.8\times$
Published: 25 May 2023
IEEE Transactions on Computers, pp 1-13; https://doi.org/10.1109/tc.2023.3280134
In the fourth industrial revolution, securing the protection of supply chains has become an ever-growing concern. One such cyber threat is a hardware Trojan (HT), a malicious modification to an IC. HTs are often identified during the hardware manufacturing process but should be removed earlier in the design process. Machine learning-based HT detection in gate-level netlists is an efficient approach to identifying HTs at the early stage. However, feature-based modeling has limitations in terms of discovering an appropriate set of HT features. We thus propose NHTD-GL in this paper, a novel node-wise HT detection method based on graph learning (GL). Given the formal analysis of the HT features obtained from domain knowledge, NHTD-GL bridges the gap between graph representation learning and feature-based HT detection. The experimental results demonstrate that NHTD-GL achieves 0.998 detection accuracy and 0.921 F1-score and outperforms state-of-the-art node-wise HT detection methods. NHTD-GL extracts HT features without heuristic feature engineering.
Published: 25 May 2023
IEEE Transactions on Computers, pp 1-12; https://doi.org/10.1109/tc.2023.3280136
Digit-recurrence algorithms are widely used in actual microprocessors to compute floating-point division and square root. These iterative algorithms present a good trade-off in terms of performance, area and power. Traditionally, commercial processors have iterative division and square root units where the iteration logic is used over several cycles. The main drawbacks of these iterative units are long latency and low throughput due to the reuse of part of the logic over several cycles, and its hardware complexity with separated logic for division and square root. We present a radix-64 floating-point division and square root algorithm with a common iteration for division and square root and where, to have an affordable implementation, each radix-64 iteration is made of two simpler radix-8 iterations. The radix-64 algorithm allows to get low-latency operations, and the common division and square root radix-64 iteration results in some area reduction. The algorithm is mapped into two different microarchitectures: a low-latency and low area iterative unit, and a low-latency and high-throughput pipelined unit. In both units speculation between consecutive radix-8 iterations is used to reduce the timing.
Published: 25 May 2023
IEEE Transactions on Computers, pp 1-15; https://doi.org/10.1109/tc.2023.3280137
Increasing performance needs of modern cyber-physical systems leads to multiprocessor architectures being increasingly utilized. To efficiently exploit their potential parallelism in hard real-time systems, appropriate task models and scheduling algorithms that allow to provide timing guarantees are required. Such scheduling algorithms and the corresponding worst-case response time analyses usually suffer from resource over-provisioning due to pessimistic analyses based on worst-case assumptions. Hence, scheduling algorithms and analyses with high resource efficiency are required. A prominent fine-grained parallel task model is the directed-acyclic-graph (DAG) task model that is composed of precedence constrained subjobs. This paper studies the hierarchical real-time scheduling problem of sporadic arbitrary-deadline DAG tasks. We propose a parallel path progression scheduling property that is implemented with only two distinct subtask priorities, which allows to quantify the parallel execution of a user chosen collection of complete paths in the response time analysis. This novel approach significantly improves the state-of-the-art response time analyses for parallel DAG tasks for highly parallel DAG structures and can provably exhaust large core numbers. Two hierarchical scheduling algorithms are designed based on this property, extending the parallel path progression properties and improve the response time analysis for sporadic arbitrary-deadline DAG task sets.
Published: 22 May 2023
IEEE Transactions on Computers, pp 1-14; https://doi.org/10.1109/tc.2023.3278541
By leveraging standard IT virtualization technology and Commercial-Off-The-Shelf (COTS) servers, Network Function Virtualization (NFV) decouples network functions from proprietary hardware devices for flexible service provisioning. But the potential of NFV is significantly limited by its performance inefficiency. With the unparalleled advantages of multi-core parallelism and high memory bandwidth, Graphics Processing Units (GPUs) are regarded as a promising way to accelerate Virtualized Network Functions (VNF). However, the special architecture of GPU brings new challenges to task scheduling and resource allocation. To this end, we propose a G PU o riented s patio- t emporal sharing framework for NFV called Gost , aiming for GPU based VNF performance promotion. The execution order and GPU resource allocation (i.e., the number of threads) are considered in task scheduling to minimize the end-to-end latency for VNF flows. First, we formulate the task scheduling problem into a nonlinear programming form, and then transform it into an equivalent Integer Linear Programming (ILP) form.The problem is proved as NP-hard. We customize the classical list scheduling algorithm and propose a List Scheduling based Spatio-Temporal GPU sharing strategy (LSSTG), whose achievable worst-case performance is also formally analyzed. We practically implement Gost prototype, based on which extensive experiments verify the high performance efficiency of LSSTG compared to state-of-the-art in terms of latency and throughput.
Published: 22 May 2023
IEEE Transactions on Computers, pp 1-12; https://doi.org/10.1109/tc.2023.3278535
A thermal simulation methodology derived from the proper orthogonal decomposition (POD) and the Galerkin projection (GP), hereafter referred to as PODTherm-GP, is evaluated in terms of its efficiency and accuracy in a multi-core CPU. The GP projects the heat transfer equation onto a mathematical space whose basis functions are generated from thermal data enabled by the POD learning algorithm. The thermal solution data are collected from FEniCS using the finite element method (FEM) accounting for appropriate parametric variations. The GP incorporates physical principles of heat transfer in the methodology to reach high accuracy and efficiency. The dynamic power map for the CPU in FEM thermal simulation is generated from gem5 and McPACT, together with the SPLASH-2 benchmarks as the simulation workload. It is shown that PODTherm-GP offers an accurate thermal prediction of the CPU with a resolution as fine as the FEM. It is also demonstrated that PODTherm-GP is capable of predicting the dynamic thermal profile of the chip with a good accuracy beyond the training conditions. Additionally, the approach offers a reduction in degrees of freedom by more than 5 orders of magnitude and a speedup of 4 orders, compared to the FEM.
Published: 22 May 2023
IEEE Transactions on Computers, pp 1-14; https://doi.org/10.1109/tc.2023.3278536
This work describes the hardware implementation of a cryptographic accelerators suite, named Crypto-Tile, in the framework of the European Processor Initiative (EPI) project. The EPI project traced the roadmap to develop the first family of low-power processors with the design fully made in Europe, for Big Data, supercomputers and automotive. Each of the coprocessors of Crypto-Tile is dedicated to a specific family of cryptographic algorithms, offering functions for symmetric and public-key cryptography, computation of digests, generation of random numbers, and Post-Quantum cryptography. The performances of each coprocessor outperform other available solutions, offering innovative hardware-native services, such as key management, clock randomisation and access privilege mechanisms. The system has been synthesised on a 7 nm standard-cell technology, being the first Cryptoprocessor to be characterised in such an advanced silicon technology. The post-synthesis netlist has been employed to assess the resistance of Crypto-Tile to power analysis side-channel attacks. Finally, a demoboard has been implemented, integrating a RISC-V softcore processor and the Crypto-Tile module, and drivers for hardware abstraction layer, bare-metal applications and drivers for Linux kernel in C language have been developed. Finally, we exploited them to compare in terms of execution speed the hardware-accelerated algorithms against software-only solutions.
Published: 22 May 2023
IEEE Transactions on Computers, pp 1-14; https://doi.org/10.1109/tc.2023.3278530
Training Deep Neural Networks (DNN) concurrently is becoming increasingly important for deep learning practitioners, e.g., hyperparameter optimization (HPO) and neural architecture search (NAS) . The GPU memory capacity is the impediment that prohibits multiple DNNs from being trained on the same GPU due to the large memory usage during training. In this paper, we propose Waterwave a GPU memory flow engine for concurrent deep learning training. Firstly, to address the memory explosion brought by the long time lag between memory allocation and deallocation time, we develop an allocator tailored for multi-streams. By making the allocator aware of the stream information, a prioritized allocation is conducted based on the chunk's synchronization attributes, allowing us to provide useable memory after scheduling rather than waiting it to be really released after GPU computation. Secondly, Waterwave partitions the compute graph to a set of continuous node groups and then performs finer-grained scheduling: NodeGroup pipeline execution , to guarantee a proper memory requests order. Waterwave can accomplish up to 96.8% of the maximum batch size of solo training. Additionally, in scenarios with high memory demand, Waterwave can outperform existing spatial sharing and temporal sharing by up to 12x and 1.49x, respectively.
Published: 15 May 2023
IEEE Transactions on Computers, pp 1-13; https://doi.org/10.1109/tc.2023.3275096
Covert channels serve the construction of cyberspace security. By realizing the secure transmission of data, it is widely used in political and financial fields. Blockchain covert channels have higher reliability and concealment compared to traditional network-based covert channels. However, existing blockchain covert storage channels need to create a large number of transactions to transmit covert information. Creating transactions requires a transation fee, which means that the implementation of blockchain covert storage channels requires a high cost. Besides, created transactions remain on-chain permanently, leading to the threat of covert information being detected. To overcome these limitations, we propose a blockchain covert timing channel framework. Specifically, we utilize inv and getdata messages in the Bitcoin transaction broadcast as carriers and propose three modulation modes to achieve covert channels without cost and leaving no trace. We evaluate the concealment of our modes by K-S, KLD tests, and machine learning approaches. Experimental results show the indistinguishability between traffic carrying covert information and normal traffic. Our channels promise a capacity of 2.4 bit/s.
Published: 11 May 2023
IEEE Transactions on Computers, pp 1-13; https://doi.org/10.1109/tc.2023.3275107
Distributed asynchronous stochastic gradient descent (ASGD) algorithms that approximate low-rank matrix factorizations for collaborative filtering perform one or more synchronizations per epoch where staleness is reduced with more synchronizations. However, high number of synchronizations would prohibit the scalability of the algorithm. We propose a parallel ASGD algorithm,
-PASGD, for efficiently handling $\eta$ synchronizations per epoch in a scalable fashion. The proposed algorithm puts an upper limit of $\eta$ on $K$ , for a $\eta$ -processor system, such that performing $K$ synchronizations per epoch would eliminate the staleness completely. The rating data used in collaborative filtering are usually represented as sparse matrices. The sparsity allows for reduction in the staleness and communication overhead combinatorially via intelligently distributing the data to processors. We analyze the staleness and the total volume incurred during an epoch of $\eta =K$ -PASGD. Following this analysis, we propose a hypergraph partitioning model to encapsulate reducing staleness and volume while minimizing the maximum number of synchronizations required for a stale-free SGD. This encapsulation is achieved with a novel cutsize metric that is realized via a new recursive-bipartitioning-based algorithm. Experiments on up to 512 processors show the importance of the proposed partitioning method in improving staleness, volume, RMSE and parallel runtime. $\eta$