Abstract
In this paper we investigate the multi-armed bandit problem, where each arm generates an innite sequence of Bernoulli distributed rewards. The parameters of these Bernoulli distributions are unknown and initially assumed to be Beta- distributed. Every time a bandit is selected its Beta-distribution is updated to new information in a Bayesian way. The objective is to maximize the long term discounted rewards. We study the relationship between the necessity of acquiring additional in- formation and the reward. This is done by considering two extreme situations which occur when a bandit has been played N times; the situation where the decision maker stops learning and the situation where the decision maker ac- quires full information about that bandit. We show that the dierence in reward between this lower and upper bound goes to zero as N grows large.

This publication has 7 references indexed in Scilit: