Distributed program reliability analysis

Abstract
The reliability of distributed processing systems can be expressed in terms of the reliability of the processing elements that run the programs, the reliability of the processing elements holding the required files, and the reliability of the communication links used in file transfers. The authors introduce two reliability measures, namely distributed program reliability and distributed system reliability, to accurately model the reliability of distributed systems. The first measure describes the probability of successful execution of a distributed program which runs on some processing elements and needs to communicate with other processing elements for remote files, while the second measure describes the probability that all the programs of a given set can run successfully. The notion of minimal file spanning trees is introduced to efficiently evaluate these reliability measures. Graph theory techniques are used to systematically generate file spanning trees that provide all the required connections. The technique is general and can be used in a dynamic environment for efficient reliability evaluation.