The Multi-Armed Bandit Problem: Decomposition and Computation

Abstract
This paper is dedicated to our friend and mentor, Cyrus Derman, on the occasion of his 60th birthday. The multi-armed bandit problem arises in sequentially allocating effort to one of N projects and sequentially assigning patients to one of N treatments in clinical trials. Gittins and Jones (Gittins, J. C., Jones, D. M. 1974. A dynamic allocation index for the sequential design of experiments. J. Gani, K. Sarkadi, L. Vince, eds. Progress in Statistics. European Meeting of Statisticians, 1972, 1, North Holland, Amsterdam, 241–266.) have shown that one optimal policy for the N-project problem, an N-dimensional discounted Markov decision chain, is determined by the following largest-index rule. There is an index for each state of each given project that depends only on the data of that project. In each period one allocates effort to a project with largest current index. The purpose of this paper is to give a short proof of this result and a new characterization of the index of a project in state i, viz., as the maximum expected present value in state i for the restart-in-i problem in which, in each state and period, one either continues allocating effort to the project or immediately restarts the project in state i. Moreover, it is shown that an approximate largest-index rule yields an approximately optimal policy. These results lead to more efficient methods of computing the indices on-line and/or for sparse transition matrices in large state spaces than have been suggested heretofore. By using a suitable implementation of successive approximations, a policy whose expected present value is within 100ϵ% of the maximum possible range of values of the indices can be found on-line with at most (N + T − 1)TM operations where M is the number of operations required to calculate one approximation, T is the least integer majorizing the ratio ln ϵ/ln a and 0 < a < 1 is the discount factor.