Reliable communication in the presence of failures
- 5 January 1987
- journal article
- Published by Association for Computing Machinery (ACM) in ACM Transactions on Computer Systems
- Vol. 5 (1), 47-76
- https://doi.org/10.1145/7351.7478
Abstract
The design and correctness of a communication facility for a distributed computer system are reported on. The facility provides support forfault-tolerant process groupsin the form of a family of reliable multicast protocols that can be used in both local- and wide-area networks. These protocols attain high levels of concurrency, while respecting application-specific delivery ordering constraints, and have varying cost and performance that depend on the degree of ordering desired. In particular, a protocol that enforces causal delivery orderings is introduced and shown to be a valuable alternative to conventional asynchronous communication protocols. The facility also ensures that the processes belonging to a fault-tolerant process group will observe consistent orderings of events affecting the group as a whole, including process failures, recoveries, migration, and dynamic changes to group properties like member rankings. A review of several uses for the protocols in the ISIS system, which supports fault-tolerant resilient objects and bulletin boards, illustrates the significant simplification of higher level algorithms made possible by our approach.Keywords
This publication has 10 references indexed in Scilit:
- Low cost management of replicated data in fault-tolerant distributed systemsACM Transactions on Computer Systems, 1986
- Replicated distributed programsPublished by Association for Computing Machinery (ACM) ,1985
- Replicated distributed programsACM SIGOPS Operating Systems Review, 1985
- Distributed process groups in the V KernelACM Transactions on Computer Systems, 1985
- Determining the last process to failACM Transactions on Computer Systems, 1985
- Reliable broadcast protocolsACM Transactions on Computer Systems, 1984
- Fault-tolerant broadcastsScience of Computer Programming, 1984
- Fail-stop processorsACM Transactions on Computer Systems, 1983
- Concurrency Control in Distributed Database SystemsACM Computing Surveys, 1981
- Time, clocks, and the ordering of events in a distributed systemCommunications of the ACM, 1978