Analysis and optimization of service availability in a HA cluster with load-dependent machine availability
- 13 August 2007
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Parallel and Distributed Systems
- Vol. 18 (9), 1307-1319
- https://doi.org/10.1109/tpds.2007.1071
Abstract
Calculations of service availability of a high-availability (HA) cluster are usually based on the assumption of load- independent machine availabilities. In this paper, we study the issues and show how the service availabilities can be calculated under the assumption that machine availabilities are load dependent. We present a Markov chain analysis to derive the steady-state service availabilities of a load-dependent machine availability HA cluster. We show that with a load-dependent machine availability, the attained service availability is now policy dependent. After formulating the problem as a Markov decision process, we proceed to determine the optimal policy to achieve the maximum service availabilities by using the method of policy iteration. Two greedy assignment algorithms are studied: least load and first derivative length (FDL) based, where least load corresponds to some load balancing algorithms. We carry out the analysis and simulations on two cases of load profiles: In the first profile, a single machine has the capacity to host all services in the HA cluster; in the second profile, a single machine does not have enough capacity to host all services. We show that the service availabilities achieved under the first load profile are the same, whereas the service availabilities achieved under the second load profile are different. Since the service availabilities achieved are different in the second load profile, we proceed to investigate how the distribution of service availabilities across the services can be controlled by adjusting the rewards vector.Keywords
This publication has 18 references indexed in Scilit:
- A Hybrid Reinforcement Learning Approach to Autonomic Resource AllocationPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2006
- A large-scale study of failures in high-performance computing systemsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2006
- Reliability and Performance of Tree-Structured Grid ServicesIEEE Transactions on Reliability, 2006
- Evaluating Web software reliability based on workload and failure data extracted from server logsIEEE Transactions on Software Engineering, 2004
- A measurement-based model for estimation of resource exhaustion in operational software systemsPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2003
- SPNP: stochastic Petri net packagePublished by Institute of Electrical and Electronics Engineers (IEEE) ,2003
- A study of service reliability and availability for distributed systemsReliability Engineering & System Safety, 2003
- Parallel randomized load balancingPublished by Association for Computing Machinery (ACM) ,1995
- Random early detection gateways for congestion avoidanceIEEE/ACM Transactions on Networking, 1993
- Availability and Reliability Modeling for Computer SystemsPublished by Elsevier BV ,1990