论文标题

更深入地看一下Actor-Critic算法中的不匹配不匹配

A Deeper Look at Discounting Mismatch in Actor-Critic Algorithms

论文作者

Zhang, Shangtong, Laroche, Romain, van Seijen, Harm, Whiteson, Shimon, Combes, Remi Tachet des

论文摘要

我们从代表学习的角度研究了参与者批评算法实施中的折现不匹配。从理论上讲,演员批判算法通常对演员和评论家都有折扣,即,在轨迹中观察到的过渡时期的演员更新中有一个$γ^t $术语,而评论家是折扣价值函数。但是,从业人员通常会在使用折扣评论家时忽略演员的折扣($γ^t $)。我们在两种情况下调查了这一不匹配。在第一种情况下,我们考虑优化一个未验证的目标$(γ= 1)$,其中$γ^t $自然消失了$(1^t = 1)$。然后,我们建议根据偏见 - 代表权取舍来解释评论家的折扣,并提供支持的经验结果。在第二种情况下,我们考虑优化一个折扣目标($γ<1 $),并建议从辅助任务的角度来解释演员更新中折扣的遗漏,并提供支持的经验结果。

We investigate the discounting mismatch in actor-critic algorithm implementations from a representation learning perspective. Theoretically, actor-critic algorithms usually have discounting for both actor and critic, i.e., there is a $γ^t$ term in the actor update for the transition observed at time $t$ in a trajectory and the critic is a discounted value function. Practitioners, however, usually ignore the discounting ($γ^t$) for the actor while using a discounted critic. We investigate this mismatch in two scenarios. In the first scenario, we consider optimizing an undiscounted objective $(γ= 1)$ where $γ^t$ disappears naturally $(1^t = 1)$. We then propose to interpret the discounting in critic in terms of a bias-variance-representation trade-off and provide supporting empirical results. In the second scenario, we consider optimizing a discounted objective ($γ< 1$) and propose to interpret the omission of the discounting in the actor update from an auxiliary task perspective and provide supporting empirical results.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源