更深入地看一下Actor-Critic算法中的不匹配不匹配

论文标题

更深入地看一下Actor-Critic算法中的不匹配不匹配

A Deeper Look at Discounting Mismatch in Actor-Critic Algorithms

论文作者

Zhang, Shangtong, Laroche, Romain, van Seijen, Harm, Whiteson, Shimon, Combes, Remi Tachet des

论文摘要

我们从代表学习的角度研究了参与者批评算法实施中的折现不匹配。从理论上讲，演员批判算法通常对演员和评论家都有折扣，即，在轨迹中观察到的过渡时期的演员更新中有一个$γ^t $术语，而评论家是折扣价值函数。但是，从业人员通常会在使用折扣评论家时忽略演员的折扣（$γ^t $）。我们在两种情况下调查了这一不匹配。在第一种情况下，我们考虑优化一个未验证的目标$（γ= 1）$，其中$γ^t $自然消失了$（1^t = 1）$。然后，我们建议根据偏见 - 代表权取舍来解释评论家的折扣，并提供支持的经验结果。在第二种情况下，我们考虑优化一个折扣目标（$γ<1 $），并建议从辅助任务的角度来解释演员更新中折扣的遗漏，并提供支持的经验结果。

We investigate the discounting mismatch in actor-critic algorithm implementations from a representation learning perspective. Theoretically, actor-critic algorithms usually have discounting for both actor and critic, i.e., there is a $γ^t$ term in the actor update for the transition observed at time $t$ in a trajectory and the critic is a discounted value function. Practitioners, however, usually ignore the discounting ($γ^t$) for the actor while using a discounted critic. We investigate this mismatch in two scenarios. In the first scenario, we consider optimizing an undiscounted objective $(γ= 1)$ where $γ^t$ disappears naturally $(1^t = 1)$. We then propose to interpret the discounting in critic in terms of a bias-variance-representation trade-off and provide supporting empirical results. In the second scenario, we consider optimizing a discounted objective ($γ< 1$) and propose to interpret the omission of the discounting in the actor update from an auxiliary task perspective and provide supporting empirical results.

下载PDF全文

下载文献需遵守相关版权规定

论文标题