Analysis and optimization of service availability in a HA cluster with load-dependent machine availability

13 August 2007

journal article
Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Parallel and Distributed Systems

Vol. 18 (9), 1307-1319
https://doi.org/10.1109/tpds.2007.1071

Abstract

Calculations of service availability of a high-availability (HA) cluster are usually based on the assumption of load- independent machine availabilities. In this paper, we study the issues and show how the service availabilities can be calculated under the assumption that machine availabilities are load dependent. We present a Markov chain analysis to derive the steady-state service availabilities of a load-dependent machine availability HA cluster. We show that with a load-dependent machine availability, the attained service availability is now policy dependent. After formulating the problem as a Markov decision process, we proceed to determine the optimal policy to achieve the maximum service availabilities by using the method of policy iteration. Two greedy assignment algorithms are studied: least load and first derivative length (FDL) based, where least load corresponds to some load balancing algorithms. We carry out the analysis and simulations on two cases of load profiles: In the first profile, a single machine has the capacity to host all services in the HA cluster; in the second profile, a single machine does not have enough capacity to host all services. We show that the service availabilities achieved under the first load profile are the same, whereas the service availabilities achieved under the second load profile are different. Since the service availabilities achieved are different in the second load profile, we proceed to investigate how the distribution of service availabilities across the services can be controlled by adjusting the rewards vector.

Keywords

This publication has 18 references indexed in Scilit:

A Hybrid Reinforcement Learning Approach to Autonomic Resource Allocation
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2006
A large-scale study of failures in high-performance computing systems
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2006
Reliability and Performance of Tree-Structured Grid Services
IEEE Transactions on Reliability, 2006
Evaluating Web software reliability based on workload and failure data extracted from server logs
IEEE Transactions on Software Engineering, 2004
A measurement-based model for estimation of resource exhaustion in operational software systems
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2003
SPNP: stochastic Petri net package
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2003
A study of service reliability and availability for distributed systems
Reliability Engineering & System Safety, 2003
Parallel randomized load balancing
Published by Association for Computing Machinery (ACM) ,1995
Random early detection gateways for congestion avoidance
IEEE/ACM Transactions on Networking, 1993
Availability and Reliability Modeling for Computer Systems
Published by Elsevier BV ,1990

Cited by 31 articles