论文标题
语义一致的跨域通过最佳传输对齐方式汇总
Semantics-Consistent Cross-domain Summarization via Optimal Transport Alignment
论文作者
论文摘要
多模式输出(MSMO)的多媒体摘要是最近探索的语言接地应用程序。它在现实世界应用程序中起着至关重要的作用,即自动生成新闻文章的封面图像和标题或为在线视频提供介绍。但是,现有方法从整个视频和文章中提取功能,并使用融合方法选择代表性的方法,因此通常忽略了临界结构和不同语义。在这项工作中,我们提出了基于视觉和文本分割的最佳传输对齐方式的语义一致的跨域摘要(SCC)模型。具体而言,我们的方法首先将视频和文章分解为段,以分别捕获结构语义。然后,SCC遵循具有最佳传输距离的跨域对准目标,该目标利用多模式相互作用匹配并选择视觉和文本摘要。我们评估了我们最近的三个多模式数据集的方法,并证明了我们方法在生产高质量多模式摘要方面的有效性。
Multimedia summarization with multimodal output (MSMO) is a recently explored application in language grounding. It plays an essential role in real-world applications, i.e., automatically generating cover images and titles for news articles or providing introductions to online videos. However, existing methods extract features from the whole video and article and use fusion methods to select the representative one, thus usually ignoring the critical structure and varying semantics. In this work, we propose a Semantics-Consistent Cross-domain Summarization (SCCS) model based on optimal transport alignment with visual and textual segmentation. In specific, our method first decomposes both video and article into segments in order to capture the structural semantics, respectively. Then SCCS follows a cross-domain alignment objective with optimal transport distance, which leverages multimodal interaction to match and select the visual and textual summary. We evaluated our method on three recent multimodal datasets and demonstrated the effectiveness of our method in producing high-quality multimodal summaries.