桥接文本和视频：视频审计的通用多模式变压器

论文标题

桥接文本和视频：视频审计的通用多模式变压器

Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog

论文作者

Li, Zekang, Li, Zongjia, Zhang, Jinchao, Feng, Yang, Niu, Cheng, Zhou, Jie

论文摘要

视听场景吸引对话框（AVSD）是在聊天时生成响应的任务，该视频是由第八对话框系统技术挑战（DSTC8）组织的。为了解决任务，我们提出了一种通用的多模式变压器，并介绍了多任务学习方法，以学习不同方式之间的联合表示，并产生信息丰富且流利的响应。我们的方法将自然语言生成预训练的模型扩展到多模式对话生成任务。我们的系统在挑战方面都取得了客观和主观评估的最佳性能。

Audio-Visual Scene-Aware Dialog (AVSD) is a task to generate responses when chatting about a given video, which is organized as a track of the 8th Dialog System Technology Challenge (DSTC8). To solve the task, we propose a universal multimodal transformer and introduce the multi-task learning method to learn joint representations among different modalities as well as generate informative and fluent responses. Our method extends the natural language generation pre-trained model to multimodal dialogue generation task. Our system achieves the best performance in both objective and subjective evaluations in the challenge.

下载PDF全文

下载文献需遵守相关版权规定

论文标题