有选择地上下文匪徒

论文标题

有选择地上下文匪徒

Selectively Contextual Bandits

论文作者

Roberts, Claudia, Dimakopoulou, Maria, Qiao, Qifeng, Chandrashekhar, Ashok, Jebara, Tony

论文摘要

上下文土匪广泛用于工业个性化系统。这些在线学习框架在存在的治疗效果的情况下学习了一种治疗作业政策，而治疗效果随着观察到的用户的上下文特征而变化。虽然个性化创造了反映个人兴趣的丰富用户体验，但在社区中有共享经验的好处，可以参与时代精神。这种好处是通过网络效应出现的，并且没有在通常用于评估匪徒的遗憾指标中捕获。为了平衡这些需求，我们提出了一种新的在线学习算法，该算法可保留个性化的好处，同时增加用户的治疗方法。我们的方法在上下文匪徒算法和无上下文的多臂匪徒之间有选择地插值，并仅在有望获得显着收益的情况下利用上下文信息才能进行治疗决策。除了帮助个性化系统的用户之间的经验之间，通过在个性化和共享之间平衡他们的经验，还可以通过有选择地依赖上下文来简化治疗作业政策，在某些情况下可以帮助提高学习速度。我们使用公共数据集在分类设置中评估我们的方法，并显示混合政策的好处。

Contextual bandits are widely used in industrial personalization systems. These online learning frameworks learn a treatment assignment policy in the presence of treatment effects that vary with the observed contextual features of the users. While personalization creates a rich user experience that reflect individual interests, there are benefits of a shared experience across a community that enable participation in the zeitgeist. Such benefits are emergent through network effects and are not captured in regret metrics typically employed in evaluating bandits. To balance these needs, we propose a new online learning algorithm that preserves benefits of personalization while increasing the commonality in treatments across users. Our approach selectively interpolates between a contextual bandit algorithm and a context-free multi-arm bandit and leverages the contextual information for a treatment decision only if it promises significant gains. Apart from helping users of personalization systems balance their experience between the individualized and shared, simplifying the treatment assignment policy by making it selectively reliant on the context can help improve the rate of learning in some cases. We evaluate our approach in a classification setting using public datasets and show the benefits of the hybrid policy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题