符号网络：关系MDP的广义神经政策

论文标题

符号网络：关系MDP的广义神经政策

Symbolic Network: Generalized Neural Policies for Relational MDPs

论文作者

Garg, Sankalp, Bajpai, Aniket, Mausam

论文摘要

关系马尔可夫决策过程（RMDP）是一阶表示，以表达具有可能无限数对象数量的单个概率计划域的所有实例。 RMDPS中的早期工作输出一般（独立于实例的）一阶策略或价值函数作为一次求解域的所有实例的手段。不幸的是，由于这种策略或价值功能中使用的表示空间的固有局限性，这一工作取得了有限的成功。神经模型能否通过轻松地表示更复杂的广义政策来提供缺失的链接，从而使它们在给定领域的所有情况下有效？我们提出了Symnet，这是解决RDDL概率计划语言表达的RMDP的第一种神经方法。 Symnet使用该域中的培训实例训练RDDL域的一组共享参数。对于每个实例，Symnet首先将其转换为实例图，然后使用关系神经模型来计算节点嵌入。然后，它将每个接地动作作为一个函数评分，这是与动作相关的一阶动作符号和节点嵌入。鉴于来自同一域的新测试实例，具有预训练参数的SYMNET体系结构分数每个地面动作并选择最佳动作。这可以在单个前传中完成，而无需在测试实例上进行任何重新培训，从而隐式地代表了整个域的神经通用政策。我们对来自IPPC的九个RDDL域进行的实验表明，符号策略明显好于随机，有时比从头开始训练最先进的深度反应性政策更有效。

A Relational Markov Decision Process (RMDP) is a first-order representation to express all instances of a single probabilistic planning domain with possibly unbounded number of objects. Early work in RMDPs outputs generalized (instance-independent) first-order policies or value functions as a means to solve all instances of a domain at once. Unfortunately, this line of work met with limited success due to inherent limitations of the representation space used in such policies or value functions. Can neural models provide the missing link by easily representing more complex generalized policies, thus making them effective on all instances of a given domain? We present SymNet, the first neural approach for solving RMDPs that are expressed in the probabilistic planning language of RDDL. SymNet trains a set of shared parameters for an RDDL domain using training instances from that domain. For each instance, SymNet first converts it to an instance graph and then uses relational neural models to compute node embeddings. It then scores each ground action as a function over the first-order action symbols and node embeddings related to the action. Given a new test instance from the same domain, SymNet architecture with pre-trained parameters scores each ground action and chooses the best action. This can be accomplished in a single forward pass without any retraining on the test instance, thus implicitly representing a neural generalized policy for the whole domain. Our experiments on nine RDDL domains from IPPC demonstrate that SymNet policies are significantly better than random and sometimes even more effective than training a state-of-the-art deep reactive policy from scratch.

下载PDF全文

下载文献需遵守相关版权规定

论文标题