stdchk: A Checkpoint Storage System for Desktop Grid Computing
- 1 June 2008
- conference paper
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
- p. 613-624
- https://doi.org/10.1109/icdcs.2008.19
Abstract
Checkpointing is an indispensable technique to provide fault tolerance for long-running high-throughput applications like those running on desktop grids. This article argues that a checkpoint storage system, optimized to operate in these environments, can offer multiple benefits: reduce the load on a traditional file system, offer high-performance through specialization, and, finally, optimize data management by taking into account checkpoint application semantics. Such a storage system can present a unifying abstraction to checkpoint operations, while hiding the fact that there are no dedicated resources to store the checkpoint data. We prototype stdchk, a checkpoint storage system that uses scavenged disk space from participating desktops to build a low-cost storage system, offering a traditional file system interface for easy integration with applications. This article presents the stdchk architecture, key performance optimizations, and its support for incremental checkpointing and increased data availability. Our evaluation confirms that the stdchk approach is viable in a desktop grid setting and offers a low cost storage system with desirable performance characteristics: high write throughput as well as reduced storage space and network effort to save checkpoint images.Keywords
Other Versions
This publication has 9 references indexed in Scilit:
- Berkeley lab checkpoint/restart (BLCR) for Linux clustersJournal of Physics: Conference Series, 2006
- Constructing collaborative desktop storage caches for large scientific datasetsACM Transactions on Storage, 2006
- Lightweight I/O for Scientific ApplicationsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2006
- Role of Protein Dynamics in Reaction Rate Enhancement by EnzymesJournal of the American Chemical Society, 2005
- Checkpointing for peta-scale systems: a look into the future of practical rollback-recoveryIEEE Transactions on Dependable and Secure Computing, 2004
- The design and implementation of a log-structured file systemACM Transactions on Computer Systems, 1992
- Basic Local Alignment Search ToolJournal of Molecular Biology, 1990
- Linearizability: a correctness condition for concurrent objectsACM Transactions on Programming Languages and Systems, 1990
- Scale and performance in a distributed file systemACM Transactions on Computer Systems, 1988