PRISM
Open Access
- 8 June 2021
- journal article
- research article
- Published by Association for Computing Machinery (ACM) in ACM Transactions on Architecture and Code Optimization
- Vol. 18 (3), 1-25
- https://doi.org/10.1145/3450523
Abstract
Multicores increasingly deploy safety-critical parallel applications that demand resiliency against soft-errors to satisfy the safety standards. However, protection against these errors is challenging due to complex communication and data access protocols that aggressively share on-chip hardware resources. Research has explored various temporal and spatial redundancy-based resiliency schemes that provide multicores with high soft-error coverage. However, redundant execution incurs performance overheads due to interference effects induced by aggressive resource sharing. Moreover, these schemes require intrusive hardware modifications and fall short in providing efficient system availability guarantees. This article proposes PRISM, a resilient multicore architecture that incorporates strong hardware isolation to form redundant clusters of cores, ensuring a non-interference-based redundant execution environment. A soft error in one cluster does not effect the execution of the other cluster, resulting in high system availability. Implementing strong isolation for shared hardware resources, such as queues, caches, and networks requires logic for partitioning. However, it is less intrusive as complex hardware modifications to protocols, such as hardware cache coherence, are avoided. The PRISM approach is prototyped on a real Tilera Tile-Gx72 processor that enables primitives to implement the proposed cluster-level hardware resource isolation. The evaluation shows performance benefits from avoiding destructive hardware interference effects with redundant execution, while delivering superior system availability.Keywords
Funding Information
- Semiconductor Research Corporation
- National Science Foundation (CNS-1929261)
This publication has 28 references indexed in Scilit:
- A scalable processing-in-memory accelerator for parallel graph processingPublished by Association for Computing Machinery (ACM) ,2015
- DualVisor: Redundant Hypervisor Execution for Achieving Hardware Error Resilience in DatacentersPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2015
- Harnessing Soft Computations for Low-Budget Fault TolerancePublished by Institute of Electrical and Electronics Engineers (IEEE) ,2014
- A Cross-Layer Multicore Architecture to Tradeoff Program Accuracy and Resilience OverheadsIEEE Computer Architecture Letters, 2014
- Selective SWIFT-RJournal of Electronic Testing, 2013
- A systematic methodology to develop resilient cache coherence protocolsPublished by Association for Computing Machinery (ACM) ,2011
- Flexible architectural support for fine-grain schedulingACM SIGPLAN Notices, 2010
- Software-controlled fault toleranceACM Transactions on Architecture and Code Optimization, 2005
- Fault Tolerance and Failure ContainmentPublished by Wiley ,2005
- ED/sup 4/I: error detection by diverse data and duplicated instructionsInternational Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2002