论文标题
具有对数遗憾和风险的安全多军匪徒的策略
Strategies for Safe Multi-Armed Bandits with Logarithmic Regret and Risk
论文作者
论文摘要
我们研究了在安全风险限制下针对多武器匪徒问题的天然但出奇的未研究方法。每个手臂都与关于安全风险和奖励的未知法律有关,学习者的目标是最大程度地提高奖励,同时不打不安全的手臂,这取决于给定的平均风险阈值。 我们为这种环境制定了伪重格,该设置通过轻轻惩罚任何违规行为,以每轮的方式强制执行此安全限制,而不论其奖励的收益如何。这与诸如临床试验之类的方案具有实际相关性,在临床试验中,必须维持每个回合的安全性,而不是在综合义务上。 我们描述了这种情况的双重乐观策略,这些策略保持了安全风险和回报的乐观指数。我们表明,基于频繁主义者和贝叶斯指数的模式满足了紧密的差距依赖对数遗憾界限,进一步表明,这些架构总共几次对数的武器仅在对数中播放不安全。该理论分析得到了模拟研究的补充,该研究证明了所提出的模式的有效性,并探测其使用的域是合适的。
We investigate a natural but surprisingly unstudied approach to the multi-armed bandit problem under safety risk constraints. Each arm is associated with an unknown law on safety risks and rewards, and the learner's goal is to maximise reward whilst not playing unsafe arms, as determined by a given threshold on the mean risk. We formulate a pseudo-regret for this setting that enforces this safety constraint in a per-round way by softly penalising any violation, regardless of the gain in reward due to the same. This has practical relevance to scenarios such as clinical trials, where one must maintain safety for each round rather than in an aggregated sense. We describe doubly optimistic strategies for this scenario, which maintain optimistic indices for both safety risk and reward. We show that schema based on both frequentist and Bayesian indices satisfy tight gap-dependent logarithmic regret bounds, and further that these play unsafe arms only logarithmically many times in total. This theoretical analysis is complemented by simulation studies demonstrating the effectiveness of the proposed schema, and probing the domains in which their use is appropriate.