Highly available, fault-tolerant, parallel dataflows
- 13 June 2004
- conference paper
- conference paper
- Published by Association for Computing Machinery (ACM)
- p. 827-838
- https://doi.org/10.1145/1007568.1007662
Abstract
We present a technique that masks failures in a cluster to provide high availability and fault-tolerance for long-running, parallelized dataflows. We can use these dataflows to implement a variety of continuous query (CQ) applications that require high-throughput, 24x7 operation. Examples include network monitoring, phone call processing, click-stream processing, and online financial analysis. Our main contribution is a scheme that carefully integrates traditional query processing techniques for partitioned parallelism with the process-pairs approach for high availability. This delicate integration allows us to tolerate failures of portions of a parallel dataflow without sacrificing result quality. Upon failure, our technique provides quick fail-over, and automatically recovers the lost pieces on the fly. This piecemeal recovery provides minimal disruption to the ongoing dataflow computation and improved reliability as compared to the straight-forward application of the process-pairs technique on a per dataflow basis. Thus, our technique provides the high availability necessary for critical CQ applications. Our techniques are encapsulated in a reusable dataflow operator called Flux, an extension of the Exchange that is used to compose parallel dataflows. Encapsulating the fault-tolerance logic into Flux minimizes modifications to existing operator code and relieves the burden on the operator writer of repeatedly implementing and verifying this critical logic. We present experiments illustrating these features with an implementation of Flux in the TelegraphCQ code base [8].Keywords
This publication has 13 references indexed in Scilit:
- Fjording the stream: an architecture for queries over streaming sensor dataPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2003
- Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web servicesACM SIGACT News, 2002
- Phoenix projectACM SIGMOD Record, 2002
- The ABCD's of PaxosPublished by Association for Computing Machinery (ACM) ,2001
- Bro: a system for detecting network intruders in real-timeComputer Networks, 1999
- A case for NOW (Networks of Workstations)IEEE Micro, 1995
- Query evaluation techniques for large databasesACM Computing Surveys, 1993
- Parallel database systemsCommunications of the ACM, 1992
- Implementing fault-tolerant services using the state machine approach: a tutorialACM Computing Surveys, 1990
- Byzantine generals in actionACM Transactions on Computer Systems, 1984