DQN真的学习吗？探索乒乓球的对抗训练计划

论文标题

DQN真的学习吗？探索乒乓球的对抗训练计划

Does DQN really learn? Exploring adversarial training schemes in Pong

论文作者

He, Bowen, Rammohan, Sreehari, Forde, Jessica, Littman, Michael

论文摘要

在这项工作中，我们研究了两种自我播放训练方案，Chainer和Pool，并表明与标准DQN代理相比，Atari Pong的代理性能提高了，该代理是针对内置的Atari对手训练的。为了衡量代理的性能，我们定义了一个鲁棒性指标，该指标捕获了学习击败代理商学到的政策的策略是多么困难。通过演奏自己的过去版本，Chainer和Pool能够针对政策中的弱点，并提高其抵抗力的攻击性。使用这些方法训练的代理在我们的稳健度度量上得分良好，并且可以轻松击败标准DQN代理。最后，我们使用线性探测来阐明不同代理商开发的内部结构来玩游戏。我们表明，与标准DQN代理相比，使用Chainer或泳池的训练代理会导致具有更大的预测能力的网络激活，以估计关键的游戏状态特征。

In this work, we study two self-play training schemes, Chainer and Pool, and show they lead to improved agent performance in Atari Pong compared to a standard DQN agent -- trained against the built-in Atari opponent. To measure agent performance, we define a robustness metric that captures how difficult it is to learn a strategy that beats the agent's learned policy. Through playing past versions of themselves, Chainer and Pool are able to target weaknesses in their policies and improve their resistance to attack. Agents trained using these methods score well on our robustness metric and can easily defeat the standard DQN agent. We conclude by using linear probing to illuminate what internal structures the different agents develop to play the game. We show that training agents with Chainer or Pool leads to richer network activations with greater predictive power to estimate critical game-state features compared to the standard DQN agent.

下载PDF全文

下载文献需遵守相关版权规定

论文标题