深度Q学习和约束

论文标题

深度Q学习和约束

Deep Inverse Q-learning with Constraints

论文作者

Kalweit, Gabriel, Huegle, Maria, Werling, Moritz, Boedecker, Joschka

论文摘要

流行的最大熵逆增强学习方法需要根据奖励功能的估计来计算最佳政策的预期国家访问频率。这通常需要在算法的内环中进行中间值估计，从而大大减慢收敛性。在这项工作中，我们介绍了一种新颖的算法类别，这些算法仅需要解决一次恢复专家政策的行为的基础MDP。这是通过为Q-学习结构中的演示的概率行为假设而利用概率行为假设的公式。我们提出了反对行动价值迭代，能够在分析中完全恢复外部代理的基本奖励。我们进一步提供了一类基于抽样的变体，这些变体不取决于环境模型。我们展示了如何通过函数近似以及如何估计相应的动作值函数将此类别的算法扩展到连续状态空间，从而使策略尽可能接近外部代理的策略，同时可以选择满足预定义的硬约束列表。我们评估了所得算法称为逆动作值迭代，对象世界基准上的逆Q学习和深度Q学习，与（深度）最大磁性算法相比，最高几个数量级的加速度。我们进一步将受约束的反向学习应用于开源模拟器Sumo中学习自动驾驶汽车变化的任务，在训练与30分钟演示的数据相对应的数据后，可以实现有能力的驾驶。

Popular Maximum Entropy Inverse Reinforcement Learning approaches require the computation of expected state visitation frequencies for the optimal policy under an estimate of the reward function. This usually requires intermediate value estimation in the inner loop of the algorithm, slowing down convergence considerably. In this work, we introduce a novel class of algorithms that only needs to solve the MDP underlying the demonstrated behavior once to recover the expert policy. This is possible through a formulation that exploits a probabilistic behavior assumption for the demonstrations within the structure of Q-learning. We propose Inverse Action-value Iteration which is able to fully recover an underlying reward of an external agent in closed-form analytically. We further provide an accompanying class of sampling-based variants which do not depend on a model of the environment. We show how to extend this class of algorithms to continuous state-spaces via function approximation and how to estimate a corresponding action-value function, leading to a policy as close as possible to the policy of the external agent, while optionally satisfying a list of predefined hard constraints. We evaluate the resulting algorithms called Inverse Action-value Iteration, Inverse Q-learning and Deep Inverse Q-learning on the Objectworld benchmark, showing a speedup of up to several orders of magnitude compared to (Deep) Max-Entropy algorithms. We further apply Deep Constrained Inverse Q-learning on the task of learning autonomous lane-changes in the open-source simulator SUMO achieving competent driving after training on data corresponding to 30 minutes of demonstrations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题