Abstract
A recently-developed perturbation formalism for finite Markov chains is used here to analyze the policy iteration algorithm for undiscounted, single-chain Markov renewal programming. The relative values are shown to be essentially partial derivatives of the gain rate with respect to the transition probabilities, and they rank the states by indicating desirable changes in the probabilistic structure. This both implies the optimality of nonrandomized policies and suggests a gradient technique for optimizing the gain rate with respect to a parameter. The policy iteration algorithm is shown to be a steepest-ascent technique in policy space: the successor to a given policy is chosen in a direction that maximizes the directional derivative of the gain rate. The occurrence during policy improvement of the gain and relative values of the original policy is explained by their essentially determining the gradient of the gain rate.