线性上下文匪徒的可扩展表示学习不断遗憾保证

论文标题

线性上下文匪徒的可扩展表示学习不断遗憾保证

Scalable Representation Learning in Linear Contextual Bandits with Constant Regret Guarantees

论文作者

Tirinzoni, Andrea, Papini, Matteo, Touati, Ahmed, Lazaric, Alessandro, Pirotta, Matteo

论文摘要

我们研究随机上下文线性匪徒中表示学习的问题。尽管该域中的主要问题通常是要找到可实现的表示（即，那些允许在任何上下文效果对中预测奖励功能的表示），但最近已显示，具有某些频谱属性（称为HLS）的表示形式对探索 - 探索-Exploitiation任务可能更有效，可以使Linucb实现常数（即，HorizOn-horton-extipentententents遗憾）。在本文中，我们提出了BanditsRL，这是一种说明学习算法，结合了一个新颖的约束优化问题，以学习可实现的表示具有良好的频谱特性，并具有广义的似然比测试，以利用回收的表示形式并避免过度探索。我们证明，可以将BanditsRL与任何无regret算法配对，并在可用的HLS表示时保持不断的遗憾。此外，BanditsRL可以很容易地与深层神经网络结合在一起，我们展示了对HLS表示的正规化如何在标准基准中有益。

We study the problem of representation learning in stochastic contextual linear bandits. While the primary concern in this domain is usually to find realizable representations (i.e., those that allow predicting the reward function at any context-action pair exactly), it has been recently shown that representations with certain spectral properties (called HLS) may be more effective for the exploration-exploitation task, enabling LinUCB to achieve constant (i.e., horizon-independent) regret. In this paper, we propose BanditSRL, a representation learning algorithm that combines a novel constrained optimization problem to learn a realizable representation with good spectral properties with a generalized likelihood ratio test to exploit the recovered representation and avoid excessive exploration. We prove that BanditSRL can be paired with any no-regret algorithm and achieve constant regret whenever an HLS representation is available. Furthermore, BanditSRL can be easily combined with deep neural networks and we show how regularizing towards HLS representations is beneficial in standard benchmarks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题