A 'cool' way of improving the reliability of HPC machines
- 17 November 2013
- conference paper
- conference paper
- Published by Association for Computing Machinery (ACM)
Abstract
No abstract availableKeywords
Funding Information
- National Science Foundation (NSF ITR-HECURA-0833188, NSF CNS 09-58314)
- U.S. Department of Energy (DOE DE-SC0001845)
- Division of Computer and Network Systems (NSF ITR-HECURA-0833188, NSF CNS 09-58314)
This publication has 14 references indexed in Scilit:
- Assessing Energy Efficiency of Fault Tolerance Protocols for HPC SystemsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2012
- Evaluating the viability of process replication reliability for exascale systemsPublished by Association for Computing Machinery (ACM) ,2011
- A 'cool' load balancer for parallel applicationsPublished by Association for Computing Machinery (ACM) ,2011
- Detection and Correction of Silent Data Corruption for Large-Scale High-Performance ComputingPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2011
- Periodic hierarchical load balancing for large supercomputersThe International Journal of High Performance Computing Applications, 2011
- A higher order estimate of the optimum checkpoint interval for restart dumpsFuture Generation Computer Systems, 2006
- Lifetime Reliability: Toward an Architectural SolutionIEEE Micro, 2005
- Towards Efficient Supercomputing: A Quest for the Right MetricPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2005
- Making a Case for Efficient SupercomputingQueue, 2003
- A first order approximation to the optimum checkpoint intervalCommunications of the ACM, 1974