多臂匪徒带有依赖的武器

论文标题

多臂匪徒带有依赖的武器

Multi-Armed Bandits with Dependent Arms

论文作者

Singh, Rahul, Liu, Fang, Sun, Yin, Shroff, Ness

论文摘要

我们研究了经典的多臂匪徒问题（MABP）的变体，我们称之为具有依赖手臂的多臂匪徒。更具体地说，将多臂分组在一起以形成一个群集，而属于同一群集的臂的奖励分布是未知参数的已知函数，该参数是集群的特征。因此，拉动ARM $ i $不仅揭示了有关其自身奖励分配的信息，而且还透露了所有与Arm $ i $共享相同集群的武器。手臂之间的这种“相关性”使MABP中遇到的探索探索折衷取舍变得复杂，因为观察依赖性使我们能够同时检验有关手臂最佳性的多个假设。我们根据UCB原则开发学习算法，该原则在执行勘探探索折衷权的同时适当地利用了这些额外的侧面观察。我们表明，我们的算法的遗憾增长为$ O（k \ log t）$，其中$ k $是簇的数量。相反，对于对经典MABP最佳且不利用这些依赖性的算法（例如香草UCB），遗憾的比例为$ o（m \ log t）$，其中$ m $是武器的数量。

We study a variant of the classical multi-armed bandit problem (MABP) which we call as Multi-Armed Bandits with dependent arms. More specifically, multiple arms are grouped together to form a cluster, and the reward distributions of arms belonging to the same cluster are known functions of an unknown parameter that is a characteristic of the cluster. Thus, pulling an arm $i$ not only reveals information about its own reward distribution, but also about all those arms that share the same cluster with arm $i$. This "correlation" amongst the arms complicates the exploration-exploitation trade-off that is encountered in the MABP because the observation dependencies allow us to test simultaneously multiple hypotheses regarding the optimality of an arm. We develop learning algorithms based on the UCB principle which utilize these additional side observations appropriately while performing exploration-exploitation trade-off. We show that the regret of our algorithms grows as $O(K\log T)$, where $K$ is the number of clusters. In contrast, for an algorithm such as the vanilla UCB that is optimal for the classical MABP and does not utilize these dependencies, the regret scales as $O(M\log T)$ where $M$ is the number of arms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题