持续环境的后采样

论文标题

持续环境的后采样

Posterior Sampling for Continuing Environments

论文作者

Xu, Wanqiao, Dong, Shi, Van Roy, Benjamin

论文摘要

我们开发了用于加固学习（PSRL）的后验采样的扩展，该采样适用于持续的代理环境界面，并将自然集成到扩展到复杂环境的代理设计中。该方法继续使用PSRL，维护了环境的统计上合理模型，并遵循了一项政策，该政策最大化了该模型中预期的$γ$降低的回报。在每次概率$1-γ$的情况下，该模型被环境后部分布的样本取代。为了选择适当取决于地平线的折扣因素，我们建立了一个$ \ tilde {o}（τs\ sqrt {a t}）$绑定在贝叶斯遗憾上，其中$ s $是环境的数量，$ a $ a $ a $ a $是行动的数量，而$τ$表示平式的奖励，该奖励的时间是均一的。我们的工作是第一个通过随机探索正式和严格分析重采样方法的工作。

We develop an extension of posterior sampling for reinforcement learning (PSRL) that is suited for a continuing agent-environment interface and integrates naturally into agent designs that scale to complex environments. The approach, continuing PSRL, maintains a statistically plausible model of the environment and follows a policy that maximizes expected $γ$-discounted return in that model. At each time, with probability $1-γ$, the model is replaced by a sample from the posterior distribution over environments. For a choice of discount factor that suitably depends on the horizon $T$, we establish an $\tilde{O}(τS \sqrt{A T})$ bound on the Bayesian regret, where $S$ is the number of environment states, $A$ is the number of actions, and $τ$ denotes the reward averaging time, which is a bound on the duration required to accurately estimate the average reward of any policy. Our work is the first to formalize and rigorously analyze the resampling approach with randomized exploration.

下载PDF全文

下载文献需遵守相关版权规定

论文标题