身份感知的多句子视频描述

论文标题

身份感知的多句子视频描述

Identity-Aware Multi-Sentence Video Description

论文作者

Park, Jae Sung, Darrell, Trevor, Rohrbach, Anna

论文摘要

标准的视频和电影描述任务从人身份中抽象出来，因此未能跨句子链接身份。我们提出了一个多句子身份感知的视频描述任务，该任务克服了此限制，并需要在一组连续的剪辑中重新确定当地人。我们介绍了填写身份的辅助任务，该任务旨在在给出视频描述时在一组剪辑中始终如一地预测人员的ID。我们提出的此任务的方法利用了变压器体系结构，允许多个ID的连贯联合预测。关键组成部分之一是性别感知的文本表示形式，也是主要模型中的另一个性别预测目标。这项辅助任务使我们能够为身份感知的视频描述提出一种两阶段的方法。我们首先生成多句视频描述，然后应用我们的填写身份模型以在预测的人实体之间建立链接。为了能够解决这两个任务，我们使用适合我们的问题声明的新注释来扩大大型电影描述挑战（LSMDC）基准。实验表明，我们提出的身份模型优于几个基线和最近的作品，并使我们能够与本地重新识别的人一起生成描述。

Standard video and movie description tasks abstract away from person identities, thus failing to link identities across sentences. We propose a multi-sentence Identity-Aware Video Description task, which overcomes this limitation and requires to re-identify persons locally within a set of consecutive clips. We introduce an auxiliary task of Fill-in the Identity, that aims to predict persons' IDs consistently within a set of clips, when the video descriptions are given. Our proposed approach to this task leverages a Transformer architecture allowing for coherent joint prediction of multiple IDs. One of the key components is a gender-aware textual representation as well an additional gender prediction objective in the main model. This auxiliary task allows us to propose a two-stage approach to Identity-Aware Video Description. We first generate multi-sentence video descriptions, and then apply our Fill-in the Identity model to establish links between the predicted person entities. To be able to tackle both tasks, we augment the Large Scale Movie Description Challenge (LSMDC) benchmark with new annotations suited for our problem statement. Experiments show that our proposed Fill-in the Identity model is superior to several baselines and recent works, and allows us to generate descriptions with locally re-identified people.

下载PDF全文

下载文献需遵守相关版权规定

论文标题