论文标题
TMT:基于变压器的模态翻译器,用于在音频视觉场景吸引对话框中改善多模式序列表示
TMT: A Transformer-based Modal Translator for Improving Multimodal Sequence Representations in Audio Visual Scene-aware Dialog
论文作者
论文摘要
视听场景吸引对话框(AVSD)是在讨论给定视频时生成响应的任务。先前的最新模型使用基于变压器的体系结构显示了此任务的卓越性能。但是,学习更好地表示方式存在一些局限性。受神经机器翻译(NMT)的启发,我们提出了基于变压器的模态转换器(TMT),以以监督的方式将源模态序列转换为相关的目标模态序列,以了解源模态序列的表示。基于多模式变压器网络(MTN),我们将TMT应用于视频和对话框,为视频接地的对话框系统提出了MTN-TMT。在对话系统技术挑战7的AVSD轨道上,MTN-TMT在视频和文本任务和仅文本任务中的MTN和其他提交模型都优于MTN和其他提交模型。与MTN相比,MTN-TMT改善了所有指标,尤其是在苹果酒上可实现高达14.1%的相对改善。索引术语:多模式学习,视听场景感知对话框,神经机器翻译,多任务学习
Audio Visual Scene-aware Dialog (AVSD) is a task to generate responses when discussing about a given video. The previous state-of-the-art model shows superior performance for this task using Transformer-based architecture. However, there remain some limitations in learning better representation of modalities. Inspired by Neural Machine Translation (NMT), we propose the Transformer-based Modal Translator (TMT) to learn the representations of the source modal sequence by translating the source modal sequence to the related target modal sequence in a supervised manner. Based on Multimodal Transformer Networks (MTN), we apply TMT to video and dialog, proposing MTN-TMT for the video-grounded dialog system. On the AVSD track of the Dialog System Technology Challenge 7, MTN-TMT outperforms the MTN and other submission models in both Video and Text task and Text Only task. Compared with MTN, MTN-TMT improves all metrics, especially, achieving relative improvement up to 14.1% on CIDEr. Index Terms: multimodal learning, audio-visual scene-aware dialog, neural machine translation, multi-task learning