global2local：视频字幕的联合层次结构关注

论文标题

global2local：视频字幕的联合层次结构关注

Global2Local: A Joint-Hierarchical Attention for Video Captioning

论文作者

Dai, Chengpeng, Chen, Fuhai, Sun, Xiaoshuai, Ji, Rongrong, Ye, Qixiang, Wu, Yongjian

论文摘要

最近，自动视频字幕引起了人们越来越多的关注，核心挑战在于捕获关键语义项目，例如对象和动作以及它们从冗余帧和语义内容中的时空相关性。为此，现有作品选择了全局级别〜（跨多帧）中的关键视频剪辑，或每个帧中的关键区域，但是，这些片段忽略了层次结构顺序，即关键帧首先和关键区域。在本文中，我们为视频字幕提出了一个新颖的联合层次重点模型，该模型将关键剪辑，关键帧和关键区域嵌入了字幕模型。这种联合层次关注模型首先进行全球选择以识别关键帧，然后进行牙龈样本采样操作，以基于关键框架来识别进一步的关键区域，从而实现了准确的全局到位置特征表示形式，以指导字幕。在两个公共基准数据集MSVD和MSR-VTT上进行了广泛的定量评估证明了该方法比最新方法的优越性。

Recently, automatic video captioning has attracted increasing attention, where the core challenge lies in capturing the key semantic items, like objects and actions as well as their spatial-temporal correlations from the redundant frames and semantic content. To this end, existing works select either the key video clips in a global level~(across multi frames), or key regions within each frame, which, however, neglect the hierarchical order, i.e., key frames first and key regions latter. In this paper, we propose a novel joint-hierarchical attention model for video captioning, which embeds the key clips, the key frames and the key regions jointly into the captioning model in a hierarchical manner. Such a joint-hierarchical attention model first conducts a global selection to identify key frames, followed by a Gumbel sampling operation to identify further key regions based on the key frames, achieving an accurate global-to-local feature representation to guide the captioning. Extensive quantitative evaluations on two public benchmark datasets MSVD and MSR-VTT demonstrates the superiority of the proposed method over the state-of-the-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题