论文标题

随机上升的土匪

Stochastic Rising Bandits

论文作者

Metelli, Alberto Maria, Trovò, Francesco, Pirola, Matteo, Restelli, Marcello

论文摘要

本文在随机多臂匪徒(mAb)的领域中,即只能使用所选选项(又称ARM)给出的反馈来在线学习的顺序选择技术。我们研究了一个静止和不安的土匪的特殊情况,其中武器的预期收益是单调的。这种特征允许设计专门制作的算法,以利用收益的规律性来提供紧密的遗憾界限。我们为静止案例(R-ED-UCB)设计了一种算法,为不安的情况(R-less-ucb)设计了一种算法,根据实例的属性,在某些情况下,遗憾的是遗憾,$ \ wideTilde {\ natercal {\ Mathcal {o}}}}}(t^{\ frac {\ frac {\ frac {\ frac {\ frac {2}}}我们从经验上将我们的算法与几个合成生成的任务的非平稳MAB的最先进方法和真实数据集的在线模型选择问题进行了比较。最后,使用合成和现实世界数据,我们说明了与非平稳匪徒的最新算法相比,提出的方法的有效性。

This paper is in the field of stochastic Multi-Armed Bandits (MABs), i.e., those sequential selection techniques able to learn online using only the feedback given by the chosen option (a.k.a. arm). We study a particular case of the rested and restless bandits in which the arms' expected payoff is monotonically non-decreasing. This characteristic allows designing specifically crafted algorithms that exploit the regularity of the payoffs to provide tight regret bounds. We design an algorithm for the rested case (R-ed-UCB) and one for the restless case (R-less-UCB), providing a regret bound depending on the properties of the instance and, under certain circumstances, of $\widetilde{\mathcal{O}}(T^{\frac{2}{3}})$. We empirically compare our algorithms with state-of-the-art methods for non-stationary MABs over several synthetically generated tasks and an online model selection problem for a real-world dataset. Finally, using synthetic and real-world data, we illustrate the effectiveness of the proposed approaches compared with state-of-the-art algorithms for the non-stationary bandits.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源