TD学习的绝热定理用于政策跟踪

论文标题

TD学习的绝热定理用于政策跟踪

An Adiabatic Theorem for Policy Tracking with TD-learning

论文作者

Walton, Neil

论文摘要

我们评估了时间差异学习随着时间的变化而跟踪策略的奖励功能的能力。我们的结果采用了一种新的绝热定理，该定理范围界定了偶然的马尔可夫链的混合时间。我们为表格的时间差学习和$ Q $ - 何时用于培训时间变化的策略何时得出有限的时间界限。为此，我们在异步绝热更新下开发了随机近似的界限。

We evaluate the ability of temporal difference learning to track the reward function of a policy as it changes over time. Our results apply a new adiabatic theorem that bounds the mixing time of time-inhomogeneous Markov chains. We derive finite-time bounds for tabular temporal difference learning and $Q$-learning when the policy used for training changes in time. To achieve this, we develop bounds for stochastic approximation under asynchronous adiabatic updates.

下载PDF全文

下载文献需遵守相关版权规定

论文标题