多尺度的2D时间临时网络，以自然语言定位

论文标题

多尺度的2D时间临时网络，以自然语言定位

Multi-Scale 2D Temporal Adjacent Networks for Moment Localization with Natural Language

论文作者

Zhang, Songyang, Peng, Houwen, Fu, Jianlong, Lu, Yijuan, Luo, Jiebo

论文摘要

我们解决了从自然语言中未修剪视频中检索特定时刻的问题。这是一个具有挑战性的问题，因为目标时刻可能是在未修剪视频中其他时间时刻的背景下发生的。现有方法无法很好地应对这一挑战，因为它们没有完全考虑时间时刻之间的时间环境。在本文中，我们通过在不同的时间尺度下的一组预定义的二维图在视频矩之间建模时间上下文。对于每个地图，一个维度指示瞬间的开始时间，另一个维度表示持续时间。这些2D时间地图可以涵盖不同长度不同的各种视频矩，同时代表其相邻的上下文在不同的时间尺度上。基于2D时间图，我们提出了一个多尺度的临时邻近网络（MS-2D-TAN），这是一个用于矩定位的单发框架。它能够在每个尺度上编码相邻的时间上下文，同时学习与参考表达式匹配的视频瞬间的区分功能。我们在三个具有挑战性的基准（即Charades-sta，ActivityNet Captions和Tacos）上评估了拟议的MS-2D-TAN，我们的MS-2D-tan在该基准中胜过了最新的状态。

We address the problem of retrieving a specific moment from an untrimmed video by natural language. It is a challenging problem because a target moment may take place in the context of other temporal moments in the untrimmed video. Existing methods cannot tackle this challenge well since they do not fully consider the temporal contexts between temporal moments. In this paper, we model the temporal context between video moments by a set of predefined two-dimensional maps under different temporal scales. For each map, one dimension indicates the starting time of a moment and the other indicates the duration. These 2D temporal maps can cover diverse video moments with different lengths, while representing their adjacent contexts at different temporal scales. Based on the 2D temporal maps, we propose a Multi-Scale Temporal Adjacent Network (MS-2D-TAN), a single-shot framework for moment localization. It is capable of encoding the adjacent temporal contexts at each scale, while learning discriminative features for matching video moments with referring expressions. We evaluate the proposed MS-2D-TAN on three challenging benchmarks, i.e., Charades-STA, ActivityNet Captions, and TACoS, where our MS-2D-TAN outperforms the state of the art.

下载PDF全文

下载文献需遵守相关版权规定

论文标题