随机上升的土匪

论文标题

随机上升的土匪

Stochastic Rising Bandits

论文作者

Metelli, Alberto Maria, Trovò, Francesco, Pirola, Matteo, Restelli, Marcello

论文摘要

本文在随机多臂匪徒（mAb）的领域中，即只能使用所选选项（又称ARM）给出的反馈来在线学习的顺序选择技术。我们研究了一个静止和不安的土匪的特殊情况，其中武器的预期收益是单调的。这种特征允许设计专门制作的算法，以利用收益的规律性来提供紧密的遗憾界限。我们为静止案例（R-ED-UCB）设计了一种算法，为不安的情况（R-less-ucb）设计了一种算法，根据实例的属性，在某些情况下，遗憾的是遗憾，$ \ wideTilde {\ natercal {\ Mathcal {o}}}}}（t^{\ frac {\ frac {\ frac {\ frac {\ frac {2}}}我们从经验上将我们的算法与几个合成生成的任务的非平稳MAB的最先进方法和真实数据集的在线模型选择问题进行了比较。最后，使用合成和现实世界数据，我们说明了与非平稳匪徒的最新算法相比，提出的方法的有效性。

This paper is in the field of stochastic Multi-Armed Bandits (MABs), i.e., those sequential selection techniques able to learn online using only the feedback given by the chosen option (a.k.a. arm). We study a particular case of the rested and restless bandits in which the arms' expected payoff is monotonically non-decreasing. This characteristic allows designing specifically crafted algorithms that exploit the regularity of the payoffs to provide tight regret bounds. We design an algorithm for the rested case (R-ed-UCB) and one for the restless case (R-less-UCB), providing a regret bound depending on the properties of the instance and, under certain circumstances, of $\widetilde{\mathcal{O}}(T^{\frac{2}{3}})$. We empirically compare our algorithms with state-of-the-art methods for non-stationary MABs over several synthetically generated tasks and an online model selection problem for a real-world dataset. Finally, using synthetic and real-world data, we illustrate the effectiveness of the proposed approaches compared with state-of-the-art algorithms for the non-stationary bandits.

下载PDF全文

下载文献需遵守相关版权规定

论文标题