COMeT: Continuous Online Memory Test
- 1 December 2011
- conference paper
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- p. 109-118
- https://doi.org/10.1109/prdc.2011.22
Abstract
Today's computers have gigabytes of main memory due to improved DRAM density. As density increases, smaller bit cells become more susceptible to errors. With an increase in error susceptibility, the need for memory resiliency also increases. Self-testing of memory health can proactively check for errors to improve resiliency. This paper describes a software-only self test to continuously test memory. We present the challenges and design for an approach, called Continuous Online Memory Testing (COMeT), that targets chip multiprocessors. COMeT tests memory health simultaneously with application execution in anticipation of allocation requests. The approach guarantees that memory is tested within a fixed time interval to limit exposure to lurking errors. We developed and evaluated an implementation of COMeT. On the SPEC CPU2006 benchmarks, COMeT has a low 4% average performance overhead. When emulated errors were injected into physical memory, applications executed 1.13× to 4.41× longer with COMeT than without it.Keywords
This publication has 20 references indexed in Scilit:
- Virtualized and flexible ECC for main memoryPublished by Association for Computing Machinery (ACM) ,2010
- StealthWorks: Emulating Memory ErrorsLecture Notes in Computer Science, 2010
- Operating system scheduling for efficient online self-test in robust systemsPublished by Association for Computing Machinery (ACM) ,2009
- Mini-rank: Adaptive DRAM architecture for improving memory power efficiencyPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2008
- Online Estimation of Architectural Vulnerability Factor for Soft ErrorsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2008
- Understanding the propagation of hard errors to software and implications for resilient system designPublished by Association for Computing Machinery (ACM) ,2008
- Susceptibility of commodity systems and software to memory soft errorsIEEE Transactions on Computers, 2004
- Cache scrubbing in microprocessors: myth or necessity?Published by Institute of Electrical and Electronics Engineers (IEEE) ,2004
- A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessorPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2004
- March tests for word-oriented memoriesPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002