对非平稳随机匪徒的动态遗憾的新镜头

论文标题

对非平稳随机匪徒的动态遗憾的新镜头

A New Look at Dynamic Regret for Non-Stationary Stochastic Bandits

论文作者

Abbasi-Yadkori, Yasin, Gyorgy, Andras, Lazic, Nevena

论文摘要

我们研究了非平稳的随机多臂强盗问题，在学习过程中，每个手臂的奖励统计数据可能会发生多次。学习算法的性能是根据其动态遗憾评估的，该遗憾的定义为代理在每个时间步骤中选择最佳手臂的预期累积奖励与学习算法的累积奖励之间的区别。测量这种环境硬度的一种方法是考虑最佳臂的身份可以改变多少次。我们提出了一种在$ k $ armed的强盗问题中实现的方法，即几乎最佳的$ \ widetilde o（\ sqrt {k n（s+1）}）$动态遗憾，其中$ n $是问题的时间范围，而$ s $是最佳手臂的次数，是最佳手臂的确定性次数，而没有先验的$ $ $ s $ s $ s $。此问题的先前作品获得了遗憾的界限，该规模与奖励功能的变化数量（或变更数量）的数量，这些更改可能更大，或者假设$ s $的先验知识以实现相似的界限。

We study the non-stationary stochastic multi-armed bandit problem, where the reward statistics of each arm may change several times during the course of learning. The performance of a learning algorithm is evaluated in terms of their dynamic regret, which is defined as the difference between the expected cumulative reward of an agent choosing the optimal arm in every time step and the cumulative reward of the learning algorithm. One way to measure the hardness of such environments is to consider how many times the identity of the optimal arm can change. We propose a method that achieves, in $K$-armed bandit problems, a near-optimal $\widetilde O(\sqrt{K N(S+1)})$ dynamic regret, where $N$ is the time horizon of the problem and $S$ is the number of times the identity of the optimal arm changes, without prior knowledge of $S$. Previous works for this problem obtain regret bounds that scale with the number of changes (or the amount of change) in the reward functions, which can be much larger, or assume prior knowledge of $S$ to achieve similar bounds.

下载PDF全文

下载文献需遵守相关版权规定

论文标题