On the value of learning for Bernoulli bandits with unknown parameters
- 1 January 2000
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Automatic Control
- Vol. 45 (11), 2135-2140
- https://doi.org/10.1109/9.887641
Abstract
In this paper we investigate the multi-armed bandit problem, where each arm generates an innite sequence of Bernoulli distributed rewards. The parameters of these Bernoulli distributions are unknown and initially assumed to be Beta- distributed. Every time a bandit is selected its Beta-distribution is updated to new information in a Bayesian way. The objective is to maximize the long term discounted rewards. We study the relationship between the necessity of acquiring additional in- formation and the reward. This is done by considering two extreme situations which occur when a bandit has been played N times; the situation where the decision maker stops learning and the situation where the decision maker ac- quires full information about that bandit. We show that the dierence in reward between this lower and upper bound goes to zero as N grows large.Keywords
This publication has 7 references indexed in Scilit:
- Markov Decision ProcessesWiley Series in Probability and Statistics, 1994
- The Learning Component of Dynamic Allocation IndicesThe Annals of Statistics, 1992
- Worth of perfect information in bernoulli banditsAdvances in Applied Probability, 1991
- A Survey of Some Results in Stochastic Adaptive ControlSIAM Journal on Control and Optimization, 1985
- Bandit problemsPublished by Springer Science and Business Media LLC ,1985
- On the optimal solution of the one-armed bandit adaptive control problemIEEE Transactions on Automatic Control, 1981
- Bayesian dynamic programmingAdvances in Applied Probability, 1975