Dependability measurement and modeling of a multicomputer system
- 1 January 1993
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Computers
- Vol. 42 (1), 62-75
- https://doi.org/10.1109/12.192214
Abstract
A measurement-based analysis of error data collected from a DEC VAXcluster multicomputer system is presented. Basic system dependability characteristics such as error/failure distributions and hazard rate are obtained for both the individual machine and the entire VAXcluster. Markov reward models are developed to analyze error/failure behavior and to evaluate performance loss due to errors/failures. Correlation analysis is then performed to quantify relationships of error/failures across machines and across time. It is found that shared resources constitute a major reliability bottleneck. It is shown that for measured system, the homogeneous Markov model, which assumes constant failure rates, overestimates the transient reward rate for the short-term operation, and underestimates it for the long-term operation. Correlation analysis shows that errors are highly correlated across machines and across time. The failure correlation coefficient is low. However, its effect on system unavailability is significant.<>Keywords
This publication has 22 references indexed in Scilit:
- A STATISTICAL LOAD DEPENDENCY MODEL FOR CPU ERRORS AT SLACPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2005
- Composite performance and dependability analysisPerformance Evaluation, 1992
- Automatic recognition of intermittent failures: an experimental study of field dataIEEE Transactions on Computers, 1990
- Availability and Reliability Modeling for Computer SystemsPublished by Elsevier BV ,1990
- Validating complex computer system availability modelsIEEE Transactions on Reliability, 1990
- Approximate availability analysis of VAXcluster systemsIEEE Transactions on Reliability, 1989
- Probabilistic modeling of computer system availabilityAnnals of Operations Research, 1987
- Measurement and modeling of computer reliability as affected by system activityACM Transactions on Computer Systems, 1986
- Effect of System Workload on Operating System Reliability: A Study on IBM 3081IEEE Transactions on Software Engineering, 1985
- Decomposition in Reliability Analysis of Fault-Tolerant SystemsIEEE Transactions on Reliability, 1983