在视觉对话框中建模核心关系

论文标题

在视觉对话框中建模核心关系

Modeling Coreference Relations in Visual Dialog

论文作者

Li, Mingxiao, Moens, Marie-Francine

论文摘要

视觉对话框是一项视觉任务，代理需要根据对话框历史记录和图像的理解来回答以图像为基础的一系列问题。对话框中的核心关系的发生使其比视觉提问的效能更具挑战性。以前的大多数作品都集中在学习更好的多模式表示或探索融合视觉和语言特征的不同方式上，而对话框中的核心发挥主要被忽略。在本文中，基于人类对话的语言知识和话语特征，我们提出了两个软限制，可以以无监督的方式提高模型在对话框中解决核心发作的能力。 Visdial V1.0数据集的实验结果表明，我们的模型在深层变压器神经架构中整合了两个新颖和语言启发的软限制，在1和其他评估指标的召回中获得了新的最新性能，与当前现有的模型相比，其他评估指标以及在其他视觉语言数据集中无需浏览。我们的定性结果也证明了我们提出的方法的有效性。

Visual dialog is a vision-language task where an agent needs to answer a series of questions grounded in an image based on the understanding of the dialog history and the image. The occurrences of coreference relations in the dialog makes it a more challenging task than visual question-answering. Most previous works have focused on learning better multi-modal representations or on exploring different ways of fusing visual and language features, while the coreferences in the dialog are mainly ignored. In this paper, based on linguistic knowledge and discourse features of human dialog we propose two soft constraints that can improve the model's ability of resolving coreferences in dialog in an unsupervised way. Experimental results on the VisDial v1.0 dataset shows that our model, which integrates two novel and linguistically inspired soft constraints in a deep transformer neural architecture, obtains new state-of-the-art performance in terms of recall at 1 and other evaluation metrics compared to current existing models and this without pretraining on other vision-language datasets. Our qualitative results also demonstrate the effectiveness of the method that we propose.

下载PDF全文

下载文献需遵守相关版权规定

论文标题