On the value of learning for Bernoulli bandits with unknown parameters

1 January 2000

journal article
Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Automatic Control

Vol. 45 (11), 2135-2140
https://doi.org/10.1109/9.887641

Abstract

In this paper we investigate the multi-armed bandit problem, where each arm generates an innite sequence of Bernoulli distributed rewards. The parameters of these Bernoulli distributions are unknown and initially assumed to be Beta- distributed. Every time a bandit is selected its Beta-distribution is updated to new information in a Bayesian way. The objective is to maximize the long term discounted rewards. We study the relationship between the necessity of acquiring additional in- formation and the reward. This is done by considering two extreme situations which occur when a bandit has been played N times; the situation where the decision maker stops learning and the situation where the decision maker ac- quires full information about that bandit. We show that the dierence in reward between this lower and upper bound goes to zero as N grows large.

Keywords

This publication has 7 references indexed in Scilit:

Markov Decision Processes
Wiley Series in Probability and Statistics, 1994
The Learning Component of Dynamic Allocation Indices
The Annals of Statistics, 1992
Worth of perfect information in bernoulli bandits
Advances in Applied Probability, 1991
A Survey of Some Results in Stochastic Adaptive Control
SIAM Journal on Control and Optimization, 1985
Bandit problems
Published by Springer Science and Business Media LLC ,1985
On the optimal solution of the one-armed bandit adaptive control problem
IEEE Transactions on Automatic Control, 1981
Bayesian dynamic programming
Advances in Applied Probability, 1975

Cited by 8 articles