暂时分配的奖励的多军匪徒问题：当部分反馈计数时

论文标题

暂时分配的奖励的多军匪徒问题：当部分反馈计数时

Multi-Armed Bandit Problem with Temporally-Partitioned Rewards: When Partial Feedback Counts

论文作者

Romano, Giulia, Agostini, Andrea, Trovò, Francesco, Gatti, Nicola, Restelli, Marcello

论文摘要

对工业在线应用程序的兴趣越来越不断增加，其中数据依次可用。受播放列表的推荐启发，我们可以在整个播放列表的聆听期间收集他们的偏好，我们研究了一个新颖的匪徒，即带有时间分配的奖励（TP-MAB）的多臂匪徒（TP-MAB），其中与手臂的随机奖励相关的随机奖励是在连续的拉动回合中分配的。到目前为止，这种设置尚未探索到我们所知，是对延迟反馈土匪的自然延伸，即在拉力后可以在有限的时间内扩张奖励，而不是在单个潜在的延迟回合中完全披露。我们提供两种解决TP-MAB问题的算法，即TP-UCB-FR和TP-UCB-EW，它们利用了随着时间的推移收集的奖励所披露的部分信息。我们表明，当特征着一组广泛的实践兴趣奖励结构（即alpha-smoothness）时，我们的算法比延迟反馈匪徒提供了更好的渐近遗憾上限。我们还从合成生成的以及来自现实世界中的媒体推荐问题中的广泛设置中进行了经验评估它们的性能。

There is a rising interest in industrial online applications where data becomes available sequentially. Inspired by the recommendation of playlists to users where their preferences can be collected during the listening of the entire playlist, we study a novel bandit setting, namely Multi-Armed Bandit with Temporally-Partitioned Rewards (TP-MAB), in which the stochastic reward associated with the pull of an arm is partitioned over a finite number of consecutive rounds following the pull. This setting, unexplored so far to the best of our knowledge, is a natural extension of delayed-feedback bandits to the case in which rewards may be dilated over a finite-time span after the pull instead of being fully disclosed in a single, potentially delayed round. We provide two algorithms to address TP-MAB problems, namely, TP-UCB-FR and TP-UCB-EW, which exploit the partial information disclosed by the reward collected over time. We show that our algorithms provide better asymptotical regret upper bounds than delayed-feedback bandit algorithms when a property characterizing a broad set of reward structures of practical interest, namely alpha-smoothness, holds. We also empirically evaluate their performance across a wide range of settings, both synthetically generated and from a real-world media recommendation problem.

下载PDF全文

下载文献需遵守相关版权规定

论文标题