Abstract
Online learning under delayed feedback has been recently gaining increasing attention. Learning with delays is more natural in most practical applications since the feedback from the environment is not immediate. For example, the response to a drug in clinical trials could take a while. In this paper, we study the multi-armed bandit problem with Bernoulli distribution in the environment with delays by evaluating the Explore-First algorithm. We obtain the upper bounds of the algorithm, the theoretical results are applied to develop the software framework for conducting numerical experiments.

This publication has 12 references indexed in Scilit: