论文标题
遗憾的是平衡和消除土匪和RL的模型选择
Regret Bound Balancing and Elimination for Model Selection in Bandits and RL
论文作者
论文摘要
我们提出了一种简单的模型选择方法,用于随机匪徒和增强学习问题中的算法。与先前(隐含地)对最佳遗憾的了解相反,我们只要求每种基础算法都带有一个候选人的遗憾,该遗憾的束缚在所有回合中可能会或可能不会举行。在每一轮中,我们的方法都会扮演一种基本算法,以使候选人对所有剩余的基本算法的后悔界面保持平衡,并消除违反其候选人束缚的算法。我们证明,这种方法的总遗憾是由最佳有效候选人遗憾的束缚界限的倍数因素。该因素在几种应用中相当小,包括具有嵌套功能类别的线性匪徒和MDP,具有未知错误指定的线性匪徒以及应用于具有不同置信参数的线性匪徒的Linucb。我们进一步表明,在合适的间隙提示下,此因子仅与基础算法的数量相比,而当圆形数量足够大时就不得与它们的复杂性。最后,与最新的线性随机匪徒模型选择不同的努力不同,我们的方法具有足够的用途,还可以涵盖上下文信息由对抗性环境而不是随机环境产生的情况。
We propose a simple model selection approach for algorithms in stochastic bandit and reinforcement learning problems. As opposed to prior work that (implicitly) assumes knowledge of the optimal regret, we only require that each base algorithm comes with a candidate regret bound that may or may not hold during all rounds. In each round, our approach plays a base algorithm to keep the candidate regret bounds of all remaining base algorithms balanced, and eliminates algorithms that violate their candidate bound. We prove that the total regret of this approach is bounded by the best valid candidate regret bound times a multiplicative factor. This factor is reasonably small in several applications, including linear bandits and MDPs with nested function classes, linear bandits with unknown misspecification, and LinUCB applied to linear bandits with different confidence parameters. We further show that, under a suitable gap-assumption, this factor only scales with the number of base algorithms and not their complexity when the number of rounds is large enough. Finally, unlike recent efforts in model selection for linear stochastic bandits, our approach is versatile enough to also cover cases where the context information is generated by an adversarial environment, rather than a stochastic one.