A 'cool' way of improving the reliability of HPC machines

Publisher Website

17 November 2013

conference paper
conference paper
Published by Association for Computing Machinery (ACM)

Abstract

No abstract available

Keywords

Funding Information

National Science Foundation (NSF ITR-HECURA-0833188, NSF CNS 09-58314)
U.S. Department of Energy (DOE DE-SC0001845)
Division of Computer and Network Systems (NSF ITR-HECURA-0833188, NSF CNS 09-58314)

This publication has 14 references indexed in Scilit:

Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2012
Evaluating the viability of process replication reliability for exascale systems
Published by Association for Computing Machinery (ACM) ,2011
A 'cool' load balancer for parallel applications
Published by Association for Computing Machinery (ACM) ,2011
Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2011
Periodic hierarchical load balancing for large supercomputers
The International Journal of High Performance Computing Applications, 2011
A higher order estimate of the optimum checkpoint interval for restart dumps
Future Generation Computer Systems, 2006
Lifetime Reliability: Toward an Architectural Solution
IEEE Micro, 2005
Towards Efficient Supercomputing: A Quest for the Right Metric
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2005
Making a Case for Efficient Supercomputing
Queue, 2003
A first order approximation to the optimum checkpoint interval
Communications of the ACM, 1974

Cited by 21 articles