Optimal Policy for Bernoulli Bandits: Computation and Algorithm Gauge

Abstract
Bernoulli multi-armed bandits are a reinforcement learning model used to study a variety of choice optimization problems. Often such optimizations concern a finite-time horizon. In principle, statistically optimal policies can be computed via dynamic programming, but doing so is considered infeasible due to prohibitive computational requirements and implementation complexity. Hence, suboptimal algorithms are applied in practice, despite their unknown level of suboptimality. In this article, we demonstrate that optimal policies can be efficiently computed for large time horizons or number of arms thanks to a novel memory organization and indexing scheme. We use optimal policies to gauge the suboptimality of several well-known finite- and infinite-time horizon algorithms including Whittle and Gittins indices, epsilon-greedy, Thompson sampling, and upper-confidence bound (UCB) algorithms. Our simulation study shows that all but one evaluated algorithm perform significantly worse than the optimal policy. The Whittle index offers a nearly optimal strategy for multi-armed Bernoulli bandits despite its suboptimal decisions—up to 10%—compared to an optimal policy table. Lastly, we discuss optimizations of known algorithms. We derive a novel solution from UCB1-tuned. It outperforms other infinite-time horizon algorithms when dealing with many arms. Impact statement—Bernoulli bandits are a reinforcement learning model used to improve decisions with binary outcomes. They have various applications ranging from headline news selection to clinical trials. Existing bandit algorithms are suboptimal. This article provides the first practical computation method, which determines the optimal decisions in Bernoulli bandits. It provides the lowest achievable decision regret (maximum expected benefit). In clinical trials, where an algorithm selects treatments for subsequent patients, our method can substantially reduce the number of unsuccessfully treated patients—by up to 5×. The optimal strategy is also used for new comprehensive evaluations of well-known suboptimal algorithms. This can significantly improve decision effectiveness in various applications.
Funding Information
  • Versyn Inc.
  • Discovery Grant NSERC (RGPIN-2016-04573)