在离散环境和连续环境中学习视觉和语言导航之间的差距

论文标题

在离散环境和连续环境中学习视觉和语言导航之间的差距

Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation

论文作者

Hong, Yicong, Wang, Zun, Wu, Qi, Gould, Stephen

论文摘要

视觉和语言导航（VLN）中的大多数现有作品都集中在离散或连续环境上，这些环境无法跨越两者。两种设置之间的基本差异是，离散导航假设对环境的连接图进行了先验知识，以便代理可以有效地将具有低级控件的导航问题转移到通过接地到可通道方向的图像来从节点跳到高级动作的节点。为了弥合离散到连续的差距，我们提出了一个预测指标，以在导航过程中生成一组候选航路点，以便可以在连续环境中将设计具有高级动作的代理转移到和培训。我们完善MatterPort3D的连接图，以适合连续的栖息地 - 模式port3d，并使用精制图训练Waypoints预测器，以在每个时间步骤中产生可访问的航路点。此外，我们证明可以在训练期间增加预测的航点，以使观点和路径多样化，从而增强了代理的概括能力。通过广泛的实验，我们表明，在具有预测航路点的连续环境中导航的代理人的性能明显优于使用低级动作的代理，这将绝对离散到连续的间隙降低了11.76％的成功率，可为跨模式匹配剂的路径长度加权（SPL）加权11.76％，而复发性vln-vln-bert vln-bert vln-bert vln-bert vln-spl。我们的代理商以简单的模仿学习目标进行了训练，以较大的差距胜过以前的方法，从而在R2R-CE和RXR-CE数据集的测试环境中实现了新的最新结果。

Most existing works in vision-and-language navigation (VLN) focus on either discrete or continuous environments, training agents that cannot generalize across the two. The fundamental difference between the two setups is that discrete navigation assumes prior knowledge of the connectivity graph of the environment, so that the agent can effectively transfer the problem of navigation with low-level controls to jumping from node to node with high-level actions by grounding to an image of a navigable direction. To bridge the discrete-to-continuous gap, we propose a predictor to generate a set of candidate waypoints during navigation, so that agents designed with high-level actions can be transferred to and trained in continuous environments. We refine the connectivity graph of Matterport3D to fit the continuous Habitat-Matterport3D, and train the waypoints predictor with the refined graphs to produce accessible waypoints at each time step. Moreover, we demonstrate that the predicted waypoints can be augmented during training to diversify the views and paths, and therefore enhance agent's generalization ability. Through extensive experiments we show that agents navigating in continuous environments with predicted waypoints perform significantly better than agents using low-level actions, which reduces the absolute discrete-to-continuous gap by 11.76% Success Weighted by Path Length (SPL) for the Cross-Modal Matching Agent and 18.24% SPL for the Recurrent VLN-BERT. Our agents, trained with a simple imitation learning objective, outperform previous methods by a large margin, achieving new state-of-the-art results on the testing environments of the R2R-CE and the RxR-CE datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题