论文标题
通用事件边界字幕的双流变压器
Dual-Stream Transformer for Generic Event Boundary Captioning
论文作者
论文摘要
本文介绍了我们针对CVPR2022通用事件边界字幕(GEBC)竞赛的冠军解决方案。 GEBC要求字幕模型可以理解给定视频边界周围的瞬时状态变化,这使其比传统的视频字幕任务更具挑战性。在本文中,提出了一个对视频内容编码和字幕生成的双流变压器的改进:(1)我们利用三个预训练的模型从不同的粒度中提取视频功能。此外,我们利用边界的类型作为提示,以帮助模型生成字幕。 (2)我们特别设计了一个称为双流变压器的模型,以学习边界字幕的区分表示。 (3)为了生成与内容相关的人物字幕,我们通过设计单词级的合奏策略来提高描述质量。 GEBC测试拆分的有希望的结果证明了我们提出的模型的功效。
This paper describes our champion solution for the CVPR2022 Generic Event Boundary Captioning (GEBC) competition. GEBC requires the captioning model to have a comprehension of instantaneous status changes around the given video boundary, which makes it much more challenging than conventional video captioning task. In this paper, a Dual-Stream Transformer with improvements on both video content encoding and captions generation is proposed: (1) We utilize three pre-trained models to extract the video features from different granularities. Moreover, we exploit the types of boundary as hints to help the model generate captions. (2) We particularly design an model, termed as Dual-Stream Transformer, to learn discriminative representations for boundary captioning. (3) Towards generating content-relevant and human-like captions, we improve the description quality by designing a word-level ensemble strategy. The promising results on the GEBC test split demonstrate the efficacy of our proposed model.