Implementation of online distributed system-level diagnosis theory

Abstract
There has been significant theoretical research in the area of system-level diagnosis. This paper documents the first practical application and implementation of on-line distributed system- level diagnosis theory. Proven distributed diagnosis algorithms are shown to be impractical in real systems due to high resource requirements. A new distributed system-level diagnosis algo- rithm, called Adaptive DSD , is shown to minimize network resources and has resulted in a practi- cal implementation. Adaptive DSD assumes a distributed network, in which network nodes can test other nodes and determine them to be faulty or fault-free. Tests are issued from each node adaptively, and depend on the fault situation of the network. Test result reports are generated from test results and forwarded between nodes in the network. Adaptive DSD is proven correct in that each fault-free node reaches an accurate independent diagnosis of the fault conditions of the remaining nodes. No restriction is placed on the number of faulty nodes, any fault situation with any number of faulty nodes is diagnosed correctly. The Adaptive DSD algorithm is implemented and currently monitors over 200 workstations in the Electrical and Computer Engineering Department at Carnegie Mellon University. The algo- rithm has executed continuously for the past year, even though no single workstation has remained fault-free over that period. Key results of this paper include: an overview of previous distributed system-level diagnosis algorithms, the specification of a new adaptive distributed sys- tem-level diagnosis algorithm, its comparison to previous centralized adaptive and distributed non-adaptive schemes, its application to an actual distributed network environment, and the experimentation within that environment.

This publication has 13 references indexed in Scilit: