Optimal Policy for Bernoulli Bandits: Computation and Algorithm Gauge

Open Access

Abstract

Bernoulli multi-armed bandits are a reinforcement learning model used to study a variety of choice optimization problems. Often such optimizations concern a finite-time horizon. In principle, statistically optimal policies can be computed via dynamic programming, but doing so is considered infeasible due to prohibitive computational requirements and implementation complexity. Hence, suboptimal algorithms are applied in practice, despite their unknown level of suboptimality. In this article, we demonstrate that optimal policies can be efficiently computed for large time horizons or number of arms thanks to a novel memory organization and indexing scheme. We use optimal policies to gauge the suboptimality of several well-known finite- and infinite-time horizon algorithms including Whittle and Gittins indices, epsilon-greedy, Thompson sampling, and upper-confidence bound (UCB) algorithms. Our simulation study shows that all but one evaluated algorithm perform significantly worse than the optimal policy. The Whittle index offers a nearly optimal strategy for multi-armed Bernoulli bandits despite its suboptimal decisions—up to 10%—compared to an optimal policy table. Lastly, we discuss optimizations of known algorithms. We derive a novel solution from UCB1-tuned. It outperforms other infinite-time horizon algorithms when dealing with many arms. Impact statement—Bernoulli bandits are a reinforcement learning model used to improve decisions with binary outcomes. They have various applications ranging from headline news selection to clinical trials. Existing bandit algorithms are suboptimal. This article provides the first practical computation method, which determines the optimal decisions in Bernoulli bandits. It provides the lowest achievable decision regret (maximum expected benefit). In clinical trials, where an algorithm selects treatments for subsequent patients, our method can substantially reduce the number of unsuccessfully treated patients—by up to 5×. The optimal strategy is also used for new comprehensive evaluations of well-known suboptimal algorithms. This can significantly improve decision effectiveness in various applications.

Keywords

Funding Information

Versyn Inc.
Discovery Grant NSERC (RGPIN-2016-04573)

This publication has 20 references indexed in Scilit:

Multi‐Armed Bandits, Gittins Index, and its Calculation
Published by Wiley ,2014
Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis
Lecture Notes in Computer Science, 2012
Multi-armed Bandit Algorithms and Empirical Evaluation
Lecture Notes in Computer Science, 2005
Choosing sample size for a clinical trial using decision analysis
Biometrika, 2003
On the value of learning for Bernoulli bandits with unknown parameters
IEEE Transactions on Automatic Control, 2000
Restless bandits: activity allocation in a changing world
Journal of Applied Probability, 1988
Adaptive Treatment Allocation and the Multi-Armed Bandit Problem
The Annals of Statistics, 1987
Modified Two-Armed Bandit Strategies for Certain Clinical Trials
Journal of the American Statistical Association, 1978
On the Theory of Apportionment
American Journal of Mathematics, 1935
On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples
Biometrika, 1933

Cited by 2 articles